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to  block  Krylov  subspaces.  Closely-related  random  projection  methods  are  compared  to  block  Krylov 
subspaces,  and  new  hybrid  approaches  are  developed.  Hybrid  random-projection  Krylov  subspace 
methods  offer  faster  compute  times  than  random  projection  methods,  and  lower  approximation  er¬ 
rors  when  sufficient  conditions  are  met.  A  novel  adaptively-blocked  Krylov  subspace  algorithm  is 
developed  that  offers  superior  compute  times  to  random  projection  methods.  Stationary  inner  it¬ 
eration  is  considered  for  accelerating  convergence  of  Krylov  subspaces  and  applied  to  the  low-rank 
approximation  problem;  a  generalization  of  eigenvalue  approximation  bounds  is  presented  for  Krylov 
subspaces  augmented  with  inner  iteration.  All  aforementioned  methods  are  evaluated  in  terms  of 
floating-point  operations  and  applied  to  numerous  problems. 
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Chapter  1 


Introduction 


The  dimension  reduction  problem  presents  itself  in  a  wide  variety  of  domains,  from  engineering  to 
biology  to  economics.  Though  the  domains  and  context  of  the  reduction  problems  may  be  disparate, 
the  presentations  of  the  problem  are  similar.  In  all  cases,  one  is  faced  with  data  —  possibly  sparse 
—  that  lies  in  a  high-dimensional  space.  A  powerful  and  popular  method  to  reduce  the  dimension  of 
the  data  is  to  project  it  into  a  low-dimensional  space  spanned  by  the  leading  singular  vectors  of  the 
matrix  A  =  [ai  02  . . .  a^]  where  ai  e  K”  are  the  data  to  be  reduced.  This  projection  minimizes  total 
Euclidean  norm  error  over  all  data  and  is  computationally  tractable  even  when  the  input  matrix 
is  large.  Nevertheless,  computing  even  a  partial  singular  value  decomposition  (SVD)  is  expensive, 
which  has  prompted  research  into  relaxed  methods  that  approximate  the  truncated  singular  value 
decomposition. 

In  particular,  Krylov  subspace  methods  can  iteratively  produce  approximations  to  singular  vec¬ 
tors,  but  at  substantially  reduced  cost  [15, 75].  We  begin  with  the  definition  of  a  Krylov  subspace. 

Definition  1.  Given  a  matrix  A  e  and  a  vector  xq  e  K”,  the  Krylov  subspace  of  A  and  xq  of 
dimension  i  is  given  by 

(1.1)  ^i{A,xo)  =  span{xo,  Axo,A^xo,. . .  ,A*“^xo} 

Krylov  subspaces  may  also  be  defined  when  the  vector  xq  is  replaced  by  a  matrix  Aq  with  b 
linearly  independent  columns. 

Definition  2.  Given  a  matrix  A  e  and  a  matrix  (or  block  vector)  Aq  e  with  b  linearly 
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independent  columns,  the  block  Krylov  subspace  of  A  and  Xq  of  dimension  ib  is  given  by 
(1.2)  jr/(A,Zo)  =  span{Zo,  AZo,  A^Zo, . . . ,  A'-%} 

To  distinguish  between  the  two  Krylov  subspaces,  we  will  refer  to  the  Krylov  subspaces  in  defi¬ 
nition  1  as  single-vector  Krylov  subspaces,  and  the  Krylov  subspaces  in  definition  2  as  block  Krylov 
subspaces.  When  the  reference  is  ambiguous,  it  is  intended  to  discuss  general  properties  that  are 
shared  between  the  two  subspace  types. 

The  substantial  reduction  in  compute  time  comes  at  the  price  of  approximation  error;  as  utilized 
in  the  literature,  Krylov  subspace  projections  may  have  a  non-trivial  difference  in  approximation 
error  when  compared  to  the  truncated  SVD  [15,  75].  These  applications  all  limit  the  number  of 
iterations  performed  to  produce  the  least-expensive  Krylov  subspace  possible,  which  is  also  likely 
the  least  converged  [10, 14, 15, 75].  We  call  such  Krylov  subspaces  minimal. 

Definition  3.  A  single-vector  Krylov  subspace  J(fi(A,xo)  of  dimension  i  is  said  to  be  minimal  for  a 
reduction  to  k  dimensions  when  k  =  i.  A  block  Krylov  subspace  J(ri(A,Zo)  with  Zq  e  is  said  to  be 
minimal  for  a  reduction  to  k  dimensions  when  (i  - 1)6  <k<ib. 

We  therefore  investigate  methods  to  accelerate  the  convergence  of  Krylov  subspaces  to  the  trun¬ 
cated  SVD  in  the  context  of  low-rank  matrix  approximation;  in  particular,  we  study  block  Krylov 
subspaces  with  and  without  deflation,  and  combination  power  iteration-Krylov  subspace  methods  to 
produce  low-rank  matrix  approximations  that  have  smaller  error,  but  are  still  less  expensive  than 
the  truncated  SVD. 

Improvements  to  Krylov  subspaces  may  be  realized  by  several  methods.  Block  Krylov  sub¬ 
spaces  [16],  deflation  [12]  and  shift-and-invert  preconditioning  [24]  are  all  used  to  accelerate  con¬ 
vergence  for  the  SVD  or  spectral  decomposition.  These  methods  have  not  been  applied  to  the  generic 
dimension  problem  with  a  restriction  on  the  number  of  iterations.  The  limitation  on  the  number  of 
iterations  introduces  a  set  of  trade-offs  which  are  distinct  from  the  trade-offs  for  these  acceleration 
methods  when  applied  to  SVD  or  spectral  decomposition  when  iterations  are  unlimited.  For  example, 
implicit  restarts  are  powerful  methods  for  allowing  an  unlimited  number  of  Lanczos  iterations  with- 
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out  storage  of  a  large  basis  set,  but  offer  no  advantage  when  the  Krylov  subspace  must  be  minimal. 
Moreover,  the  extra  overhead  introduced  to  execute  an  implicit  restart  —  which  is  intended  to  allow 
for  an  unlimited  number  of  stable  iterations  —  adds  overhead  that  may  not  be  necessary  if  only  one 
restart  is  ever  performed. 

1.1  The  low-rank  approximation  problem 

Matrix  approximation  methods  are  powerful  techniques  that  can  cope  with  high-dimensional  data 
and  extract  a  low-rank  approximation  that  preserves  features  well  [23,39,76].  Matrix  approximation 
methods  have  a  compelling  set  of  features  that  render  them  popular  choices  for  dimension  reduction; 
they  minimize  the  sum-of-squares  approximation  error  over  all  points  of  data,  they  are  guaranteed 
to  converge  to  a  solution,  and  they  are  computationally  tractable.  These  methods  grow  out  of  the 
following  basic  matrix  approximation  problem:  given  a  matrix  A  e  with  rank  r  «  min(n,m), 

find  a  matrix  A^*^  with  rank  k  such  that  the  norm  of  the  error  matrix 

(1.3)  ||S||  =  ||A-A<*>|| 

is  minimized  for  some  norm  ||  •  ||.  We  define  the  relevant  norms  for  this  effort. 

Definition  4.  Given  an  mxn  matrix  A  with  singular  values  a  ({A)  for  i  =  l,2,...,r,  the  Frobenius 
norm  of  A  is  defined  as 

where  Aij  is  the  entry  of  A  in  column  i  and  row  j. 

Definition  5.  Given  a  matrix  A  with  singular  values  atiA)  for  i  =  1,2,. . .  ,n,  the  nuclear  norm  of  A 
is  defined  as 

(1.5)  \\A\\,  =  '^ai(A) 

i=l 

These  norms  are  defined  in  terms  of  singular  values  of  A;  this  implies  that  the  singular  value 
decomposition  (SVD)  will  be  consequential  in  analysis  of  these  norms. 

Theorem  1.  Let  A  be  an  arbitrary  nxm  matrix,  and,  without  loss  of  generality,  that  m>n.  Then  we 
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the  singular  value  decomposition  of  A  gives 

(1.6)  A  =  UYV’^, 

where  U  is  mxn  and  satisfies  U^U  =  I,  V  is  nxn  and  satisfies  V'^V  =  I,  and  Z  =  diag(CTi,  0-2,  •  •  • ,  (^n)> 
where  the  singular  values  oi>O2>-..>0.  The  columns  ui,...,  Um  of  U  are  called  left  singular 
vectors  of  A,  and  the  columns  vi,...,Vn  ofV  are  called  right  singular  vectors  of  A.  When  m<n, 
then  the  singular  vector  decomposition  may  be  defined  in  terms  of  A^ . 

The  spectral  decomposition  is  also  a  diagonal  matrix  factorization  we  will  consider  herein. 

Theorem  2.  Let  A  be  a  square  nxn  Hermitian  matrix.  Then  the  spectral  decomposition  (alter¬ 
nately,  the  eigenvalue  decomposition)  of  A  gives 

(1.7)  A  =  UAU'^, 

where  U  satisfies  U^U  =  I,  and  A  =  diag(Ai,  A2, . . . ,  A^)  and  the  eigenvalues  of  A  have  Ai  >  A2  >  . . .  > 
Are.  The  columns  u\,...,UnofU  are  called  eigenvectors  of  A. 

The  that  minimizes  the  Frobenius  norm  error  ||  •  \\f  is  unique  and  prescribed  by  the  singular 
value  decomposition  (SVD)  of  A  or  the  spectral  decomposition  of  an  associated  Gram  matrix  AA^ 
or  A^A.  Likewise,  the  A^^^  that  is  optimal  for  minimization  of  the  nuclear  norm  error  is  given  by 
the  truncated  SVD  of  A  or  spectral  decomposition  of  A^A.  SVDs  and  spectral  decompositions  have 
polynomial  complexity,  may  be  computed  iteratively,  and  are  available  in  several  high-performance 
libraries. 

Nevertheless,  for  sufficiently  large  data  or  data  with  particular  spectral  characteristics,  the  sin¬ 
gular  value  or  spectral  decomposition  is  costly  [15];  many  iterations  may  be  required  to  obtain  total 
convergence  of  interior  eigenvalues  as  predicted  by  eigenvalue  bounds  [17,67].  Slow  convergence  is 
particularly  problematic  for  eigenvalues  that  are  not  well-separated  from  their  neighbors,  but  such 
eigenvalue  clustering  may  be  encountered  in  low-rank  approximation  problems,  especially  those  re¬ 
quiring  a  reduction  to  hundreds  of  dimensions.  Compute  time  and  space  savings  may  be  realized 
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Figure  1.1:  Computational  costs  and  singular  value  approximation  errors  of  subspace  projections  from  mini¬ 
mal  Krylov  subspaces  to  singular  vector  spaces. 


by  relaxing  the  requirement  for  the  global  minimizer  of  (1.3)  to  an  that  is  close  to  the  global 
minimizer.  This  sub-optimal  A^^^  may  be  computed  in  less  time  and  tvith  less  memory  than  the  true 
minimizer  of  (1.3).  Krylov  subspaces  may  be  used  as  an  alternate  to  a  singular  vector  or  eigenvector 
space  for  a  dimension  reduction  problem  [10,15,63].  Krylov  subspaces  are  widely  used  for  iteratively 
solving  the  singular  value  or  spectral  decomposition  due  to  their  strong  convergence  properties  and 
modest  computational  costs.  A  Krylov  subspace  will  contain  good  approximations  of  the  singular 
triplets  or  eigenpairs  needed  for  minimizing  (1.3)  even  after  only  a  few  iterations.  One  may  simply 
use  a  Krylov  subspace  that  approximates  singular  triplets  or  eigenpairs  well  to  find  an  A^^^  that 
makes  (1.3)  small.  The  Krylov  subspaces  used  in  the  current  literature  all  use  minimal  Krylov  sub¬ 
spaces,  which  require  the  fewest  iterations  and  are  the  least  expensive  to  produce,  but  also  have 
the  poorest  eigenvalue  approximations.  Between  minimal  Krylov  subspaces  and  converged  singular 
vector  or  eigenvector  spaces  there  is  a  spectrum  of  computational  possibilities,  with  differing  costs 
and  singular  value  approximation  properties,  as  illustrated  in  Figure  1.1.  The  goal  of  this  research  is 
to  apply  acceleration  to  Krylov  suhspaces  that  have  a  minimal  or  near-minimal  number  of  iterations. 
These  accelerated  Krylov  subspaces  are  intended  for  approximation  of  the  SVD  or  spectral  decompo¬ 
sition  as  applied  to  a  dimension  reduction  problem.  These  accelerated  Krylov  subspace  projections 
will  have  error  smaller  than  a  minimal  Krylov  subspace  with  a  modest  increase  in  computational 
cost.  The  increase  in  computational  cost  will  be  smaller  than  the  cost  of  the  fully  converged  SVD  or 
spectral  decomposition. 

The  low-rank  matrix  approximation  attempts  to  minimize  (1.3).  As  suggested  previously,  the 
SVD  gives  the  optimal  A^^^  for  any  A  for  the  Frobenius  norm.  This  minimization  is  a  natural  con¬ 
sequence  of  the  definition  of  the  Frobenius  norm  in  terms  of  singular  values  of  A.  To  show  how  the 
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SVD  can  be  used  to  generate  the  optimal  low-rank  let  A  be  a  n  x  m  real  matrix  and  m  <n  with¬ 
out  any  loss  of  generality.  Write  A  =  U'LV'^  as  the  SVD  of  A  with  Z  =  diag(cri,  (T2,  . . . ,  Um,  0, . . .).  Let 
CTi  >  (72  —  •  •  •  —  <^m  be  the  ordering  of  the  singular  values  of  A.  We  have  the  definition  of  the  Frobenius 
norm  from  (4).  Because  of  the  ordering  of  singular  values,  no  partial  sum  of  squares  of  k  singular 
values  is  greater  than  and  no  partial  sum  of  squares  oi  m  —  k  singular  values  is  less  than 

T.’iLk+i^'i-  Therefore,  the  optimal  A^^^  is  given  by 

(1.8)  A^^^  =  Uk-LkV^ 

where  Uk  and  V'^  are  the  first  k  left  and  right  singular  vectors  as  defined  in  Theorem  1.6. 

The  generality  of  the  matrix  approximation  problem  has  resulted  in  its  widespread  application  to 
dimension  reduction  in  various  domains.  The  generality  is  a  result  of  the  way  in  which  the  problem 
is  posed:  no  domain-specific  information  or  objective  function  is  used.  Furthermore,  the  low-rank 
matrix  approximation  problem  minimizes  squared  Euclidean  distance  over  the  data,  which  implies 
preservation  of  geometric  features.  More  formally,  if  S  c  is  a  set  of  data  that  is  high-dimensional, 
then  the  dimension  reduction  problem  seeks  to  find  an  approximation  set  S  such  that  for  any  point 
at  e  S  there  is  a  corresponding  approximation  d;  e  S  with  ||aj  -dj|||  small.  The  total  squared  Eu¬ 
clidean  norm  error  for  one  column  is  given  by 

n 

(l*fi)  —  W^i  djll2  —  * 

7=1 

Note  that  if  we  assemble  a  matrix  A  =  [ai  02  . . .  0^1  and  a  corresponding  A^*^  =  [di  0-2  ■ .  .dm!  then 
the  squared  error  quantity  in  (1.3)  is 

||E|||  =||A-A<«||| 

where  e;  is  as  in  (1.9). 

The  Frobenius  norm  of  A  is  also  closely  related  to  the  nuclear  norm  of  the  positive  semi-definite 
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Gram  matrices  A^A  and  AA^.  Eigenvalues  of  AA^  and  A^A  are  squares  of  singular  values  of  A. 
We  have  A  =  U'LV'^ .  Let  (a;,  Ui,ai)  be  a  singular  triplet  of  A.  By  construction, 

(1.11)  AA^Ui  =  aiAvi  =  o\ui 

by  the  definition  of  singular  vectors  of  A.  The  case  for  A^A  is  similar.  Moreover,  left  and  right 
singular  vectors  are  eigenvectors  of  AA^  and  A^A.  Since  AA^  is  positive  semi-definite,  the  SVD 
and  spectral  decompositions  coincide,  so  singular  values  of  AA^  are  also  singular  values  of  A^A.  As 
the  nuclear  norm  is  defined  as 

m 

(1.12)  ||A|U  =  Xffi 

j=i 

it  is  clear  that 

m  m 

(1.13)  II A  III  =  ^  ^  =  II AA^  II  *  =  tr(AA^), 

i=l  i=l 

where  the  Aj  are  eigenvalues  of  AA^.  Note  that  the  trace  and  the  nuclear  norm  coincide  for  pos¬ 
itive  semi-definite  matrices  because  the  SVD  and  spectral  decompositions  coincide.  Many  existing 
generic  dimension  reduction  methods  use  the  spectral  decomposition  of  AA^  to  produce  a  dimension 
reduction  [76];  however,  these  are  equivalent  to  using  the  SVD  of  A. 

1 .2  Approximation  of  the  SVD  for  low-rank  approximation 

We  have  remarked  that  the  posing  of  the  low-rank  approximation  problem  renders  the  problem 
tractable,  but  nevertheless  it  can  consume  considerable  compute  time  and  memory  storage.  Though 
this  statement  may  appear  conflicted,  its  core  meaning  is  that  the  complexity  of  the  problem  does 
not  grow  so  quickly  with  problem  size  that  it  is  unsolvable  in  practice.  Even  though  the  problem  is 
solvable,  and  there  exist  well-tuned  algorithms  for  solving  the  truncated  SVD  or  spectral  decompo¬ 
sition,  these  algorithms  —  which  are  iterative  —  may  take  many  iterations  to  converge  completely. 
Complete  convergence  may  not  be  necessary  to  obtain  an  approximation  with  low  error. 

Numerous  methods  are  available  to  use  for  reduced-cost  approximation  of  eigen-  or  singular 
vector  spaces  for  low-rank  matrix  approximation.  Krylov  subspaces  are  but  one  approach;  they  have 
been  studied  previously  for  low-rank  matrix  approximation  problems  [10, 14, 15, 75].  More  recently. 
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random  projection  methods  have  been  studied  and  contrasted  with  Krylov  subspace  approaches  for 
low-rank  matrix  approximation  [36].  These  two  methods  share  deep  similarities  with  each  other.  The 
success  of  random  projection  methods  suggests  Krylov  subspace  approaches  alternate  to  the  methods 
proposed  previously;  these  alternate  methods  may  produce  better  convergence  or  faster  compute 
times  compared  to  existing  Krylov  subspace  approaches.  Much  of  the  inspiration  for  this  work  arises 
out  of  comparison  of  Krylov  subspace  and  random  projection  methods  for  low-rank  approximation. 

Though  both  Krylov  subspace  methods  and  random  projection  approaches  have  already  been 
proposed  and  evaluated  for  low-rank  approximation  problems,  many  open  questions  remain  which 
merit  further  attention.  Common  acceleration  methods  for  Krylov  subspaces  used  for  eigenvalue  or 
SVD  problems  have  not  been  employed  in  generic  dimension  reduction  Krylov  subspace  applications. 
Specifically,  the  following  key  open  questions  remain  regarding  Krylov  subspaces  as  applied  to  low- 
rank  matrix  approximation. 

1.  Block  Krylov  subspaces  [64]  have  not  been  applied  to  the  generic  dimension  reduction  problem. 

2.  The  relaxation  of  the  requirement  for  a  converged  SVD  or  spectral  decomposition  allows  for 
new  algorithms  to  be  developed.  Such  algorithms  may  exploit  the  nonnecessity  of  eigenvalue 
convergence  to  address  common  deficiencies  in  Krylov  subspace  algorithms.  In  particular,  the 
difficulties  caused  by  round-off  error  may  be  ameliorated  or  eliminated  altogether. 

3.  Loss  of  orthogonality  for  the  single-vector  Lanczos  algorithm  has  been  characterized  in  [73], 
but  an  analysis  for  the  block  Lanczos  algorithm,  the  principal  block  Krylov  subspace  algorithm 
for  Hermitian  problems,  does  not  exist. 

4.  Synthesis  of  random  projections  —  particularly  stationary  power  iteration  —  and  Krylov  sub¬ 
spaces  may  lead  to  faster  or  better  low-rank  approximation  when  compute  resources  are  con¬ 
strained.  These  constraints  may  be  in  the  form  or  either  time  restrictions  of  memory  limi¬ 
tations.  Though  synthesis  is  appealing,  no  formal  mechanisms  exist  to  enable  comparative 
analysis  of  random  projections  versus  hybrid  Krylov-random  projection  methods. 

To  address  these  open  questions,  we  develop  the  following  novel  contributions. 
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A.  We  develop  the  novel  approaches  using  short  block  Krylov  subspaces,  including  the  “shrink- 
and-iterate”  method  (section  3.2.2)  and  the  hybrid  block  Krylov-random  projection  method  (Al¬ 
gorithm  5),  which  addresses  items  1  and  4. 

B.  We  present  Algorithm  4,  which  consists  of  modifications  to  the  Lanczos  algorithm  [4]  that 
substantially  reduce  loss  of  orthogonality  when  used  for  the  “shrink-and-iterate”  method  and 
eliminate  loss  of  orthogonality  for  the  hybrid  method.  These  modifications  address  item  2. 

C.  Theorems  7  and  8  permit  predictive  comparison  of  the  “shrink-and-iterate”  and  hybrid  short 
block  Krylov  subspace  methods  and  random  projections.  These  theorems  are  supported  by 
Lemmas  1  and  3.  Lemma  2  and  Theorem  4  allow  Theorems  7  and  8  to  be  used  for  comparison 
of  approximation  error  between  random  projections  and  our  new  short  block  Krylov  subspace 
algorithms.  Such  comparison  is  not  possible  using  existing  bounds.  These  theorems  and  lem¬ 
mas  address  item  3. 

D.  In  Theorem  9,  we  adapt  the  bounds  on  loss  of  orthogonality  from  [73]  to  the  block  Lanczos 
algorithm.  This  allows  one  to  bound  loss  of  orthogonality  to  support  analysis  of  item  2. 

E.  We  develop  GrABL,  presented  in  algorithm  10,  that  adaptively  changes  block  size  to  reduce 
compute  costs.  GrABL  is  distinct  from  other  block  method  via  its  deflation  criterion,  and  infla¬ 
tion  method  described  in  Algorithm  9.  GrABL  is  tailored  for  low-rank  matrix  applications  of 
Krylov  subspaces.  To  support  development  and  analysis  of  the  inflation  and  deflation  criteria 
of  GrABL,  we  develop  Theorem  10.  GrABL  is  a  novel  algorithm  which  is  also  a  synthesis  of 
random  projection  and  Krylov  subspace  methods  to  address  item  4. 

F.  We  analyze  Algorithm  11,  a  variant  of  the  single- vector  Lanczos  method  [4]  with  inner  itera¬ 
tion  [69],  using  stationary  powers  similar  to  random  projection  methods.  Bounds  to  charac¬ 
terize  the  effects  of  inner  power  iteration  do  not  exist.  To  illustrate  the  effects  of  inner  power 
iteration  on  low-rank  approximation,  we  present  Theorem  11,  which  adapts  eigenvalue  conver¬ 
gence  bounds  from  [67].  This  algorithm  and  analysis  address  item  4. 

G.  We  develop  Algorithm  13,  a  synthesis  of  random  projections  and  Krylov  subspaces  for  inverse 
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spectrum  problems  to  address  item  4. 

The  remainder  of  this  thesis  follows  the  preceding  outline.  In  Chapter  2  we  present  background 
material  covering  both  Krylov  subspace  approaches  and  random  projections  for  low-rank  approxi¬ 
mation.  We  study  the  convergence  of  eigenvalues  and  contrast  the  asymptotic  rates  of  convergence. 
We  then  develop  alternate  uses  of  Krylov  subspaces  for  low-rank  approximation  problem;  the  use 
of  block  Krylov  subspaces  with  large  block  sizes  is  suggested  by  random  projection  methods,  but 
block  Krylov  subspaces  for  generic  low-rank  approximation  have  not  been  investigated  or  compared 
to  single-vector  Krylov  subspaces. 

In  Chapters  3  and  4  present  the  “shrink-and-iterate”  and  hybrid  block  Krylov-random  projec¬ 
tion  methods.  These  methods  are  intended  to  be  faster  than  random  projection  methods,  or  to  pro¬ 
duce  better  eigenvalue  and  eigenvector  approximations  than  random  projection  methods.  We  have 
observed  that  random  projections  may  produce  lower  approximation  error  than  the  single- vector 
Krylov  subspace  methods  in  the  literature;  therefore,  the  “shrink-and-iterate”  and  hybrid  methods 
will  also  produce  better  approximations  than  the  single-vector  Krylov  subspace  approaches  stud¬ 
ied  previously.  Additionally,  these  short  block  Krylov  subspace  methods  have  dramatically  reduced 
loss  of  orthogonality  compared  to  traditional  Krylov  subspace  methods.  Loss  of  orthogonality  is  a 
significant  problem  in  all  existing  Krylov  subspace  methods;  elimination  of  orthogonality  loss  also 
eliminates  the  need  for  reorthogonalization  methods  [57, 73]  that  add  expense  and  complicatedness 
to  Krylov  subspace  algorithms. 

We  then  present  in  Chapters  5  and  6  block  methods  that  use  deflation  to  best  leverage  the  advan¬ 
tages  presented  by  large  block  sizes  while  simultaneously  managing  the  additional  computational 
costs  imposed.  Adaptive  deflation  of  block  size  allows  for  asymptotically  superior  compute  complex¬ 
ity  compared  to  random  projections.  Adaptively-blocked  Krylov  subspace  methods  exist  [5,86],  but 
are  designed  for  indefinite  iteration.  When  applied  to  the  low-rank  matrix  approximation  problem 
with  a  limited  number  of  iterations,  the  current  adaptive  block  methods  produce  larger  approxima¬ 
tion  errors  than  random  projections.  Development  of  new  inflation  subspaces  and  deflation  criteria 
produces  a  new  algorithm,  GrABL,  that  produces  comparable  performance  to  random  projections, 
while  maintaining  the  compute  cost  advantages  over  random  projections. 
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1.  Introduction 


Random  projection  methods  also  suggest  the  use  of  power  iteration  as  an  acceleration  method; 
we  study  the  theoretical  and  practical  effects  of  power  iteration  acceleration  in  Chapter  7.  Power 
iteration  acceleration  is  far  less  expensive  than  shift-and-invert  preconditioning,  and  is  well-suited 
to  shifting  convergence  to  the  leading  extreme  of  the  spectrum  and  away  from  the  trailing  extreme. 
Inner  power  iteration  is  especially  suited  for  generating  a  Krylov  subspace  of  minimal  length  when 
matrix-vector  products  are  inexpensive  but  memory  storage  is  constrained.  No  expressions  exist 
to  characterize  the  change  in  eigenvalue  convergence  due  to  inner  power  iteration.  By  providing 
asymptotic,  worst-case  error  expressions,  we  allow  for  quantitative  comparison  of  the  convergence  of 
eigenvalues  in  Krylov  subspaces  formed  with  inner  power  iteration  to  the  convergence  of  eigenvalues 
in  ordinary  Krylov  subspaces.  Such  comparisons  allow  for  informed  choices  that  may  weigh  the  con¬ 
vergence  advantage  of  inner  power  iteration  against  the  extra  compute  costs  inner  power  iteration 
imposes. 

Shift-and-invert  preconditioning  may  be  applied  to  accelerate  the  convergence  of  minimal  Krylov 
subspaces  whenever  the  input  matrix  admits  an  efficient  sparse  factorization.  We  develop  new 
strategies  for  applying  and  combining  shifts  specifically  for  low-rank  approximation  problems,  and 
compare  them  to  exiting  shift-and-invert  methodologies  in  Chapter  8.  Faster  convergence  may  be  ob¬ 
tained  by  better  shift  choices.  Finally,  we  compare  all  methods  in  an  application  to  model  reduction. 
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Chapter  2 


Low-rank  approximation  background 


Krylov  subspaces  have  properties  favorable  for  low-rank  approximation  problems  that  are  solvable 
with  a  truncated  SVD  or  spectral  decomposition.  Eigenvalues  or  singular  values  usually  converge 
rapidly  in  Krylov  subspaces,  and  the  spectral  components  needed  for  a  truncated  singular  vec- 
torspace  or  eigenspace  projection  are  among  those  that  bounds  show  will  converge  fastest  in  a  Krylov 
subspace.  This  fast  convergence  implies  that  a  Krylov  subspace  will  contain  good  approximations  to 
eigenvalues  in  only  a  few  iterations.  Appljdng  a  modest  limit  to  the  number  of  iterations  should 
not  deteriorate  the  quality  of  leading  eigenvalue  and  eigenvectors.  One  may  simply  perform  k  iter¬ 
ations  of  a  Krylov  subspace  algorithm  to  obtain  the  least  expensive  ^-dimensional  Krylov  subspace 
approximation  possible,  rather  than  compute  all  k  leading  eigenvalues  or  singular  values  to  conver¬ 
gence.  This  is  precisely  what  has  been  proposed  to  date  for  Krylov  subspace  dimension  reduction  and 
low-rank  approximation;  the  least-expensive  Krylov  subspace  is  used. 

Apart  from  their  good  approximation  properties,  Krylov  subspace  methods  lend  themselves  to 
efficient  computation,  especially  when  the  input  matrix  is  Hermitian.  In  these  cases,  there  are 
Krylov  subspace  algorithms  that  produce  an  orthonormal  basis  for  a  Krylov  subspace  without  the 
need  for  a  Gram-Schmidt  processes  or  Householder  rotations  [4].  Additionally,  Krylov  subspace 
methods  can  produce  a  projection  of  the  input  matrix  in  the  Krylov  subspace  as  iteration  proceeds, 
rather  than  first  generating  a  basis  and  then  projecting  the  input  matrix  A  onto  that  basis.  Krylov 
subspace  methods  do  have  drawbacks;  principally  that  they  are  prone  to  corruption  by  round-off 
error,  and  that  both  leading  and  trailing  eigenvalues  converge  fastest  in  a  Krylov  subspace  [17, 67]. 
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The  former  problem  is  well-known  and  has  been  addressed  by  methods  to  stabilize  against  the  effects 
of  round-off  error  [57, 73].  The  latter  difficulty  is  less  of  a  problem  in  traditional  applications  of  Krylov 
subspaces;  indeed,  this  property  is  an  asset,  as  it  allows  for  computation  of  either  end  of  the  spectrum 
with  the  same  algorithm.  Yet  for  low-rank  approximation  problems  in  which  the  number  of  iterations 
available  is  limited,  convergence  of  trailing  spectral  components  can  be  problematic. 

Random  projection  methods  are  closely  related  to  Krylov  subspace  methods;  both  use  matrix- 
vector  products  to  generate  a  low-rank  subspace  in  which  the  important  characteristics  —  approx¬ 
imation  of  leading  eigenpairs  or  singular  triplets,  in  most  cases  —  are  well-preserved.  There  are 
important  distinctions  between  Krylov  subspace  methods  and  random  projection  methods,  both  in 
terms  of  computational  requirements  and  convergence  of  eigenvalues  or  singular  values.  Random 
projection  methods  admit  analysis  as  Krylov  subspace  methods,  and  these  asymptotic  analyses  pro¬ 
vide  additional  insight  into  the  spectral  properties  of  input  matrices  that  drive  convergence  of  the 
two  respective  methods. 

2.1  Generic  dimension  reduction  background 

The  generic  dimension  reduction  problem  is  by  no  means  new,  and  a  great  deal  of  the  existing  mea¬ 
sures  reduce  to  the  low-rank  matrix  approximation  problem.  An  overwhelming  number  of  these 
generic  techniques  are  direct  descendants  or  reducible  to  Principal  Component  Analysis  (PCA)  [39]. 
The  Krylov  methods  under  consideration  here  all  approximate  the  PCA  or  related  dimension  reduc¬ 
tion  method.  We  briefly  review  the  methods  dimension  reduction  methods  we  aim  to  approximate 
with  Krylov  subspace  approximations. 

PCA  is  perhaps  the  oldest  generic  dimension  reduction  method.  Originally  due  to  Pearson  [59], 
PCA  in  its  original  form  is  a  geometric  dimension  reduction  method,  and  attempts  to  solve  a  min¬ 
imization  problem  closely  related  to  (1.3).  Instead  of  the  matrix  A  =  [ai  02  PCA  uses  the 

centered  matrix  A  -  jU  =  [(ai  -  ju)  (02  -  ju)  .  ..{a  with  /i  =  1/m  L™  1  The  spectral  decomposition 

on  the  Gram  matrix  (A  —  jLi)(A  —  =  UAU^  provides  the  projection 

(2.1)  A^^^  =  U^(A-^)  =  ZkV^ 
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with  the  SVD  (A  —  ju)  =  U'LV'^ .  This  is  equivalent  to  the  truncated  SVD  approximation  derived  from 
A  —  fi.  Statistically,  the  centered  Gram  matrix  (A  —  fi){A  —  is  identical  to  the  covariance  matrix  of 
the  underlying  distribution  of  the  data  points  a  j,  up  to  scaling. 

Numerous  dimension  reduction  methods  are  a  result  of  application  of  PCA  or  a  PCA-like  method; 
the  proper  orthogonal  decomposition  (POD)  [13,48,62]  from  signals  analysis  and  model  reduction, 
eigenfaces  [79,80]  from  face  recognition  and  latent  semantic  indexing  (LSI)  [19,21]  from  informa¬ 
tion  retrieval  all  use  eigenvector  or  singular  vector  projections.  The  chief  differences  between  these 
methods  are  due  to  the  preprocessing  and  formation  of  the  matrix  A,  such  as  centering  used  in  PCA 
or  term  scaling  used  in  LSI.  As  these  methods  all  use  the  SVD  or  the  spectral  projection,  they  all 
share  the  same  computational  advantages  and  drawbacks:  the  optimal  low-rank  approximation  is 
tractable  and  high-performance  libraries  exist,  but  large  problems  may  require  large  compute  times. 

Spectral  graph  methods  also  use  the  spectral  decomposition  of  a  positive  semi-definite  matrix  to 
discover  properties  of  a  graph.  Various  graph  properties  may  be  determined  with  spectral  methods, 
from  vertex  clustering  [53,54,61,87]  to  visualization  [37,43,70]  to  partitioning  [60].  These  methods 
operate  on  the  graph  Laplacian  matrix;  either  the  combinatorial  Laplacian  or  normalized  Laplacian 
may  be  used.  Both  Laplacian  matrices  are  defined  in  terms  of  the  adjacency  matrix  of  the  graph  A, 
and  the  diagonal  degree  matrix  of  the  graph,  D.  Both  methods  also  require  the  eigenpairs  {Xi,Ui) 
with  the  smallest  A;  rather  than  the  pairs  with  largest  magnitudes  as  are  used  in  PCA  and  PCA-like 
methods. 

2.2  Krylov  subspace  background 

We  recall  the  definition  of  a  Krylov  subspace  from  Definition  1: 

(2.2)  J^i(A,xo)  =  spanjxo,  Axo,A^xo,. . .  ,A*“^xo} 

Krylov  subspaces  are  naturally  amenable  to  iteration;  one  adds  basis  vectors  for  spanjA^xol  one  after 
the  other.  These  basis  vectors  may  be  orthogonalized  against  one  another  to  produce  an  orthonormal 
projection.  We  note  that  the  linear  independence  of  the  vectors  A'xq  is  only  guaranteed  in  infinitely- 
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precise  arithmetic.  As  realized  on  a  computer,  the  vectors  A'xq  lose  linear  independence  rapidly,  and 
are  not  practical  for  use  in  generating  a  basis.  Instead,  basis  vectors  are  orthogonalized  with  respect 
to  each  other  as  they  are  generated  to  reduce  the  effects  of  round-off  error  [4,44].  Each  basis  vector 
improves  the  quality  of  the  solution  [31, 67, 81, 85];  iteration  may  be  terminated  once  the  solution  is 
of  acceptable  quality.  Krylov  subspaces  are  also  defined  in  block  form  in  which  the  start  vector  xq  is 
replaced  with  a  matrix  Xq  e  with  all  columns  linearly  independent. 

Krylov  subspace  methods  are  widespread  choices  for  iteratively  solving  a  truncated  SVD  or  spec¬ 
tral  decomposition.  Extremal  singular  triplets  or  eigenpairs  converge  fastest  in  Krylov  subspaces; 
these  are  a  subset  of  the  leading  singular  triplets  or  eigenpairs  needed  to  solve  the  dimension  reduc¬ 
tion  problem.  The  rapid  convergence  of  extremal  singular  triplets  or  eigenpairs  in  Krylov  subspaces 
suggests  that  one  may  simply  terminate  iteration  before  complete  convergence  of  the  truncated  de¬ 
composition.  This  would  produce  an  approximation  that  is  close  to  the  optimal  solution  of  (1.3)  but 
with  reduced  costs.  The  greatest  opportunity  for  reduction  of  computational  costs  are  given  by  mini¬ 
mal  Krylov  subspaces,  which  perform  only  as  many  iterations  as  are  needed  to  induce  a  reduction  to 
k  dimensions. 

Krylov  subspaces  for  low-rank  matrix  approximation  and  related  dimension  reduction  problems 
has  been  studied  previously.  Simon  et.  al.  [75]  first  proposed  Krylov  subspaces  as  an  alternative  to 
singular  or  spectral  projections  for  low-rank  matrix  approximation.  Blom  and  Ruhe  [9, 10]  applied 
Krylov  subspaces  to  information  retrieval  tasks  related  to  low-rank  approximation,  and  showed  how 
to  leverage  a  priori  information  via  the  start  vector  xq.  Alternate  relaxations  to  the  SVD  for  in¬ 
formation  retrieval  are  proposed  in  [7,  8].  Chen  and  Saad  [15]  studied  dimension  reduction  tasks 
including  information  retrieval  and  eigenfaces  [80].  Chen,  Eang  and  Saad  [14]  applied  Krylov  sub¬ 
space  approximation  to  the  eigenvector  approximation  problem  for  graph  partitioning.  In  addition  to 
these  applications,  Elden  has  shown  [22]  that  Krylov  subspace  projections  are  equivalent  to  partial 
least  squares  [83],  a  dimension  reduction  method  widely  used  in  the  chemometrics  community.  All  of 
these  applications  realized  a  maximal  compute  time  improvement  with  minimal  Krylov  subspaces. 
More  recently,  Halko  et.  al.  [36]  have  investigated  random  projection  methods,  which  bear  deep 
connections  with  minimal  block  Krylov  subspaces. 
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The  good  results  reported  in  the  literature  for  Krylov  subspaces  for  low-rank  approximation  are 
due  to  the  rapid  convergence  of  spectral  components  used  for  the  optimal  projection  in  Krylov  sub¬ 
spaces.  Fortunately,  many  low-rank  approximation  problems  have  just  the  spectral  characteristics 
that  drive  good  convergence  in  Krylov  subspaces.  Nevertheless,  minimal  Krylov  subspaces  of  dimen¬ 
sion  k  will  often  contain  eigenvalue  approximations  (alternately  called  Ritz  values  and  denoted 
di)  where  Xi  —  di  is  non-trivial.  When  this  is  the  case,  there  may  be  a  non-trivial  gap  in  error  between 
the  SVD-derived  rank-k  approximation,  which  we  write  as  and  the  Krylov  subspace  rank-k 

approximation,  which  we  write  as  Acceleration  of  convergence  in  the  Krylov  subspace  may  be 
realized  by  application  of  block  methods  with  deflation,  and  shift-and-invert  preconditioning  [4].  To 
connect  the  acceleration  methods  proposed,  we  will  review  results  on  the  convergence  of  Ritz  values 
to  eigenvalues  in  Krylov  subspaces. 

2.3  Convergence  of  eigenvalues  in  Krylov  subspaces 

When  spectral  properties  are  favorable,  eigenvalues  will  converge  rapidly  in  Krylov  subspaces.  There 
are  two  results  that  provide  lower  error  bounds  on  eigenvalue  estimates  in  Krylov  subspaces,  one  due 
to  Saad  [67]  and  another  due  to  Underwood  [31].  Examination  of  these  bounds  provides  insight  into 
what  spectral  properties  drive  convergence  of  Ritz  values  in  Krylov  subspaces,  and  transformations 
that  will  accelerate  convergence.  The  convergence  properties  of  eigenvalues  in  Krylov  subspaces  will 
also  illuminate  the  difficulties  posed  by  the  convergence  of  both  leading  and  trailing  eigenvalues  in 
a  Krylov  subspace. 

The  nature  of  eigenvalue  convergence  in  Krylov  subspaces  was  finally  resolved  in  the  1960s  and 
1970s,  beginning  with  Kaniel  [40],  proceeding  with  Paige  [55],  Underwood  [81]  and  culminating  with 
Saad  [67].  These  asymptotic  bounds  show  that  Krylov  subspace  projections  indeed  do  produce  good 
eigenvalue  approximations  with  only  a  modest  number  of  iterations.  We  present  the  bounds  due  to 
Underwood,  that  generalize  the  Paige  results  to  block  Krylov  subspaces. 

Theorem  3  (Underwood  [81]).  Consider  a  Krylov  subspace  JCi{A,Xo)  with  block  size  r  of  a  Hermitian 
matrix  A  with  eigenvalues  ordered  as  Ai  >  A2  >  ...  >  Ainf.  Let  Xj  be  an  eigenvector  of  a  matrix  A 
with  associated  eigenvector  uj  with  ||ajH  =  1.  Then  for  j  =  l,2,...,r  the  eigenvalue  approximation 
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(alternately,  Ritz  value)  A” 


(0 


to  Aj  is  bounded  as 


(2.3) 


J  J 


tan^0 


where 

(2.4) 


jj  =  l  +  2 


Aj  Ar+l 
Ar+1  ~  Ajnf 


T'j(-)  is  the  Chehyshev  polynomial  of  the  first  kind  with  order  i,  and  0  =  cos  ^  t  where  r  is  the  smallest 
singular  value  of  the  matrix  U'^Xq  for  U  =  [uiU2  ...  Ur\- 


Errors  are  driven  to  zero  by  the  growth  of  the  denominator  in  (2.3).  The  definition  of  jj  shows 
the  influence  of  local  gaps  in  the  spectrum  of  the  input  matrix  A.  Well-separated  eigenvalues  will 
result  in  large  jj,  which  causes  the  denominator  of  (2.3)  to  be  large  as  well.  The  denominator  may 
be  grown  through  increased  Krylov  subspace  dimension  through  the  squared  Chebyshev  polynomial 
which  grows  rapidly  with  the  increased  order  resulting  from  successive  iterations.  Errors  are 
also  driven  to  zero  by  the  block  size  increasing  fj.  Note  that  these  bounds  are  for  magnitude  only;  it 
may  be  the  case  for  a  Ritz  pair  {6i,Vi)  and  a  corresponding  eigenpair  (A;,  Ui)  that  9i  -  Aj  is  small  but 
\\vi  -  Ui  II  is  not  small.  However,  for  the  low-rank  approximation  problem,  this  is  less  of  a  problem,  as 
the  error  in  (1.3)  is  due  only  to  eigenvalues. 

These  bounds  also  apply  to  the  trailing  component  of  the  spectrum  of  A;  a  Krylov  subspace  will 
contain  approximations  to  both  leading  and  trailing  eigenvalues.  This  is  problematic  for  minimal 
Krylov  subspace  projections  if  trailing  eigenpairs  converge  at  least  as  quickly  as  the  leading  eigen- 
pairs.  In  general,  the  quality  of  a  Krylov  subspace  approximation  may  be  improved  by  accelerating 
the  convergence  of  leading  eigenvalues.  Though  one  may  also  generate  a  Krylov  subspace  of  dimen¬ 
sion  2k  and  discard  the  trailing  k  Ritz  pairs,  such  an  approach  may  consume  a  prohibitive  amount 
of  memory  when  the  dimension  of  A  is  large.  Therefore,  we  restrict  ourselves  to  minimal  or  near- 
minimal  Krylov  subspace  methods. 

One  may  note  that  if  the  local  gaps  Aj  —  A^+i  result  in  rapidly- growing  jj  for  small  increases 
in  r,  then  one  may  apply  a  block  Krylov  subspace  method.  Increasing  block  size  will  accelerate 
convergence  of  the  eigenvalues.  Alternately,  applying  a  transformation  function  f  to  the  spectrum 
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of  A  will  accelerate  convergence  of  eigenvalues  when  it  makes  jj  large  for  leading  eigenvalues  but 
small  for  trailing  ones.  This  is  the  principle  behind  shift-and-invert  preconditioning:  the  spectrum 
of  A  is  preconditioned  with  f(Aj)  =  {Aj  -  s)“^  for  some  shift  s.  Eigenvalues  close  to  s  will  have  large 
magnitude  and  be  well-separated  from  the  remainder  of  the  spectrum. 

2.4  Low-rank  approximation  error  and  eigenvalue  approximations 

For  low-rank  matrix  approximation  in  Krylov  subspaces,  we  measure  squared  Frobenius  norm  error 
when  the  matrix  is  not  positive  semi-definite;  for  positive  semi-definite  matrices,  the  matrix  trace 
is  a  useful  error  measurement  norm.  Ideally,  we  would  like  to  use  the  squared  Frobenius  norm  of 
the  low-rank  approximation  to  reason  about  the  low-rank  approximation  error.  Indeed,  there  is  a 
close  relationship  between  the  squared  Frobenius  norm  of  the  approximation  error  and  the  squared 
Frobenius  norm  of  the  low  rank  approximation.  We  present  the  following  original  theorem  to  show 
the  relationship  between  low-rank  approximation  error  and  the  norm  of  the  low-rank  approximation 
^(k)  formed  by  the  projection  of  the  input  matrix  A  into  a  subspace. 

Theorem  4.  Let  A  be  an  arbitrary  real  rectangular  matrix  and  Q  be  an  orthonormal  projection 
matrix  and  P  be  an  orthonormal  matrix  with  span{P}  =  span{AQ}.  Let  M  =  P^AQ  and  A**^  =  PMQ^ . 
Then 

(2.5)  II A  -  A<*>  III  =  tr(AA^)  -  tr(MM^). 


Proof.  Let  A,  Q,  P,  M  and  A^^^  be  as  given.  By  definition  of  the  Frobenius  norm,  ||A|||  =  tr(AA^). 
Then 

(2.6)  II A  -  A'*^  III  =  tr((A  -  A‘^>)(A  -  A^*^)^)  =  tr(AA^)  -  tr(2AA<^>^)  -H  tr(A^^U‘^>^) 

since  tr(AA<^>^)  =  tr(A(^>A^).  Note  that  tr(A(^>A<^)^)  =  tr(PMM^P^)  =  tr(MM^).  Also,  as  AQ  =  PM, 

(2.7)  AA<^>^  =  A(QM^P^)  =  PMM^P^. 
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Then 

(2.8)  tr(Ai<^^^)  =  tr(MM^) 
and  (2.6)  simplifies  to 

(2.9)  II A  -  A<^>|||  =  tr(AA^)  -  tr(MM^) 

which  completes  the  proof  of  the  theorem.  □ 

This  result  allows  us  to  use  the  convergence  of  eigenvalues  to  infer  the  low-rank  approximation 
error. 


2.5  Krylov  subspace  algorithms 

There  are  a  multitude  of  Krylov  subspace  algorithms  for  the  eigenproblem,  and  a  detailed  review  of 
them  is  beyond  the  scope  of  this  section.  Bai  has  complied  many  of  them  in  [4].  In  this  thesis,  we  are 
concerned  mainly  with  the  original  Hermitian  Lanczos  algorithm  [44],  the  block  Hermitian  Lanczos 
algorithm  [16],  and  the  single-vector  [30]  and  block  version  [31]  of  the  Golub-Kahan  algorithm  for 
rectangular  matrices. 

The  focus  on  the  Lanczos  algorithm  is  due  to  its  efficiency.  Note  that  one  may  generate  a  ba¬ 
sis  for  D?”  if  A  has  full  column  rank.  We  noted  that  Krylov  subspace  projections  have  the  property 
that  extremal  eigenvalues  converge  rapidly  for  Hermitian  A  with  increasing  iterations.  If  we  are  to 
generate  an  orthonormal  basis  for  the  Krylov  subspace,  a  naive  approach  would  simply  apply  the 
Gram-Schmidt  process  to  the  series  (A^xq  )\Zq  -  This  method  is  prone  to  round-off  error;  moreover,  it 
has  complexity  in  0{nk^).  The  Lanczos  algorithm  partially  addresses  round-off  error  by  orthogonal- 
izing  as  it  goes,  and  leverages  the  Hermitian-ness  of  the  matrix  to  eliminate  the  need  for  the  entire 
Gram-Schmidt  process.  Instead  of  requiring  the  new  vector  at  step  i  be  orthogonalized  against  all 
i  - 1  previous  vectors,  the  Lanczos  process  only  requires  the  previous  3  vectors  to  generate  an  or¬ 
thonormal  basis.  This  orthogonalization  does  prevent  the  Krylov  subspace  basis  vectors  from  losing 
linear  independence  longer  than  would  be  for  Krylov  vectors  A*xo,  but  round-off  error  does  eventu¬ 
ally  cause  loss  of  orthogonality  between  the  subspace  basis  vectors  if  no  extra  stabilization  is  used. 
More  importantly,  all  iterations  have  equal  cost,  and  extended  iteration  is  bound  only  by  storage 
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of  the  basis  vectors,  rather  than  by  the  costs  of  the  orthonormalization  process.  The  basic  Lanczos 

algorithm  is  presented  in  Algorithm  1.  The  Lanczos  algorithm  may  be  generalized  to  accept  a  block 

Algorithm  1  Classic  Lanczos  algorithm 
Require:  a  priori  chosen  start  vector  q  i 

1:  r  ^  Aqi 

2:  for  j  =  1  —  ^  do 

3:  aj^qjr 

4:  qj+i^r-qjaj 

5: 

6:  qj+i^qj+iiPj+i 

7:  r^Aqj+i 

8:  qj+i^r-qj^j+l 

9:  end  for 


10:  return 


[gi  q2 


ai 

/^2 

a2 

as 

f^4 


input  vector  Aq  with  r  columns.  The  block  Lanczos  algorithm  is  presented  in  Algorithm  2. 

Though  both  Lanczos  algorithms  address  the  round-off  error  problems  of  the  naive  approach 
somewhat,  they  still  cannot  guarantee  orthogonality  of  all  basis  vectors  in  finite-precision  arithmetic. 
Eigenvectors  that  have  converged  will  re-enter  the  Krylov  subspace  and  lead  to  spurious  eigenvalue 
approximations.  Some  degree  of  reorthogonalization  is  required  to  recover  from  round-off  error,  but 
applying  a  Gram-Schmidt  step  at  each  iteration  would  wreck  the  improvement  in  complexity  gained 
from  the  Hermitian-ness  of  A.  Selective  [57]  and  partial  reorthogonalization  [74]  both  address  the 
instability  of  the  Lanczos  algorithm  due  to  round-off  error,  and  have  computational  costs  that  are 
almost  certainly  less  than  simply  performing  a  Gram-Schmidt  step  at  each  iteration. 

The  random  projection  methods  in  [36]  also  merit  attention  at  this  point.  Though  they  are  not 
true  Krylov  subspace  methods,  they  may  be  considered  a  variant  of  the  acceleration  methods  we 
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Algorithm  2  Classic  block  Lanczos  algorithm 
Require;  a  priori  chosen  start  block  Qi 

1:  R  ^AQi 

2:  for  J  =  1  —  ^  do 

3:  Aj^QjR 

4:  R^R-  QjAj 

5:  QR  factorize  R  =  Qj+iBj+i 

6:  R  ^  AQj^\ 

7:  R^R-QjRT-^^ 

8:  end  for 

Ai  Bl 
B2  A2  B^ 

9:  return  [Qi  Q2  ...], 

Bs  As  Rj 

study  here.  They  are  near  minimal,  and  never  store  more  than  k  basis  vectors  to  produce  a  k- 
dimensional  approximation.  These  methods  use  span{A^Xo}  with  Xq  £  to  produce  their  approx¬ 
imation.  These  methods  are  variants  of  power  iteration  [4, 32],  but  with  a  block  instead  of  a  vector. 
Convergence  of  the  lead  eigenvalues  is  accelerated  with  block  sizes,  and  the  trailing  Ritz  vector  es¬ 
timates  are  excluded  with  a  restart.  We  have  observed  that  this  method  may  produce  lower-error 
approximations  than  single-vector  Lanczos,  and  serves  as  a  suitable  starting  point  for  investigation 
of  accelerated  block  methods. 

2.6  Random  projection  methods 

Random  projection  methods  [36]  for  approximating  the  truncated  SVD  are  an  alternative  to  Krylov 
subspace  methods,  though  they  have  relevant  similarities  to  Krylov  subspace  methods.  As  their 
name  implies,  random  projection  methods  use  an  orthogonal  projection  to  transform  a  large,  likely 
sparse  problem  into  a  small,  dense  problem  that  approximates  the  original.  Like  Krylov  subspace 
methods,  random  projection  methods  form  an  orthogonal  projection  basis  using  matrix  products  of 
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the  input  matrix.  But  unlike  Krylov  subspaces,  random  projections  do  not  use  a  series  of  matrix 
products;  instead,  they  only  use  one  matrix  power  for  p  e  N.  The  random  projection  subspace  is 
then 

(2.10)  S  =  span{A^n} 

for  a  Hermitian  input  matrix  A  and  a  random  block  vector  fl.  The  basic  algorithm  for  the  random 

projection  method  for  approximating  the  leading  k  eigenvalues  of  a  Hermitian  input  matrix  is  given 

in  Algorithm  3.  Rather  than  generating  subspace  dimensions  through  matrix  powers,  random  pro- 

Algorithm  3  Direct  eigenvalue  approximation  for  Hermitian  matrices  with  random  projections 
Require;  random  matrix  fl  for  multiplication  against  A,  optional  power  p  e  N. 

1:  QR  factorize  QR  =  A^H 

2:  A  ^Q'^AQ 

3:  eigendecompose  A  =  UQU^ 

4:  return  Q,QU 


jection  methods  use  a  block  vector  with  linearly  independent  columns.  The  relationship  with  Krylov 
subspaces  is  through  their  mutual  relationship  with  the  power  method.  Additionally,  random  projec¬ 
tion  methods  may  be  viewed  as  block  Krylov  subspace  methods,  with  minimal  iteration  and  maximal 
block  sizes.  The  random  projection  space  span{Af2}  will  approximate  as  only  the  space 

span{f2}  -  spanlAH}  differs  between  them. 

Random  projection  methods  also  have  strong  properties  for  low-rank  approximation  of  matrices; 
in  [36]  the  error  of  a  low-rank  approximation  of  A**^  =  Q^AQ  where  span{Q}  =  span{A^fl}  is  bounded 
as 

(2.11)  ||A-A<^>||  <  IIZ2II  + 


where 

(2.12) 


A  =  [UiU2] 
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[KiV2]^ 


is  the  SVD  of  A  with  diagonal  entries  of  Z  in  descending  order,  and  D.i  =  U^D,  and  D.2  =  Note 

that  these  bounds  are  in  terms  of  the  aggregate  error  over  all  eigenvalues  of  A  rather  than  individual 
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eigenvalues.  The  motivation  for  the  bounds  of  this  type  as  opposed  to  the  individual  eigenvalue 
bounds  in  (2.3)  is  that  they  enable  convenient  specification  to  the  case  when  H  is  a  random  matrix 
and  the  singular  values  of  Hi  and  0,2  assume  expected  values.  Additionally,  bounds  of  this  type 
clearly  express  how  well  a  random  projection  space  can  be  used  for  approximating  a  truncated  SVD 
or  spectral  decomposition. 

A  further  merit  of  random  projection  methods  is  that  they  do  not  suffer  the  problems  with  loss 
of  orthogonality  that  afflict  Krylov  subspace  methods.  Even  when  the  matrix  is  badly  conditioned, 
the  entire  projection  basis  is  orthogonalized  at  step  1  before  it  is  used  to  project  the  input  matrix 
A.  Therefore,  loss  of  orthogonality  may  only  result  from  loss  of  linear  independence  of  columns  of 
A^Xq.  We  have  not  observed  the  loss  of  orthogonality  between  basis  vectors  generated  by  the  random 
projection  method  to  be  greater  than  machine  error  e  in  any  case. 

2.7  Discussion 

Krylov  subspace  methods  and  random  projections  share  some  important  similarities;  much  of  the 
this  thesis  explores  the  similarities  between  them  and  the  opportunities  for  hybridization.  These 
hybrid  approaches  attempt  to  combine  the  best  attributes  of  random  projection  methods  and  Krylov 
subspace  methods  for  fast  low-rank  approximation.  The  random  projection  methods  suggest  use  of 
Krylov  subspaces  with  large  blocks  may  attain  better  convergence  and  greater  stability  than  itera¬ 
tion  with  a  single-vector.  However,  the  computational  complexity  of  random  projections  and  block 
Lanczos  iteration  shows  that  costs  are  asymptotically  quadratic  with  the  block  size,  so  smaller  block 
sizes  will  be  faster  and  scale  better  than  large  block  sizes.  We  first  study  the  use  of  large  block  sizes 
with  a  small  number  of  Lanczos  iterations  and  their  effects  on  convergence  relative  to  the  random 
projection  method.  We  then  study  how  computational  costs  can  be  minimized  by  performing  block 
iteration  only  when  necessary  to  improve  convergence,  and  reducing  the  block  size  adaptively  when 
a  wide  block  is  no  longer  advantageous.  The  power  parameter  p  on  line  1  of  Algorithm  3  used  to 
improve  convergence  in  the  random  projection  method  motivates  study  of  the  use  of  power  refine¬ 
ment  to  produce  better  start  blocks  to  reduce  tan^0  in  (2.3)  and  to  act  as  an  acceleration  method 
to  improve  convergence  of  the  leading  eigenvalues  or  singular  values  in  the  Krylov  subspace.  Alter- 
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nate  preconditioning  methods  may  be  drawn  from  existing  techniques  used  for  conditioning  Krylov 
subspaces. 

The  overarching  assumption  is  that  either  Krylov  subspace  projections  or  random  projections 
will  be  used  for  approximation  of  a  truncated  SVD  or  spectral  decomposition  with  an  emphasis  on 
computational  savings  over  a  converged  truncated  SVD.  The  number  of  power  refinements  available 
to  the  random  projection  method  will  be  limited,  as  is  the  number  of  iterations  available  to  Krylov 
subspace  methods.  This  constraint  is  key  to  the  hybrid  approaches  developed  herein;  if  power  refine¬ 
ments  are  not  limited,  or  if  Krylov  iterations  are  not  limited,  many  of  the  low-rank  approximations 
considered  in  this  effort  will  eventually  converge  to  an  exact  solution.  We  are  interested  in  solving 
an  approximation  of  the  truncated  SVD  or  spectral  decomposition  as  rapidly  as  possible,  and  assume 
constrained  compute  time  and  storage  space. 
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Short  block  Krylov  subspaces  for  low-rank  ap¬ 
proximation 


The  low-rank  matrix  approximation  problem  presents  itself  in  many  domains.  To  find  the  rank- 
k  approximation  that  minimizes  the  Frobenius  norm,  one  may  simply  apply  the  SVD  to  the  input 
matrix  A  and  truncate  it  to  form  This  truncated  SVD  solution  minimizes  the  Frobenius  norm  of 
E  =  A  -A^^\  As  discussed  previously,  the  SVD  is  tractable,  but  expensive.  Relaxation  of  the  low-rank 
problem  to  allow  a  solution  A^^^  that  may  not  be  optimal  but  still  makes  \\E\\f  small  may  lead  to  non- 
negligible  computational  time  savings.  Minimal  single-vector  Krylov  subspaces  have  been  proposed 
for  this  problem  [15, 75],  but,  as  was  discussed  in  Chapter  2,  minimal  single-vector  Krylov  subspace 
projections  may  have  a  non-trivial  gap  in  norm  between  the  PCA  approximation  and  the  minimal 
Krylov  subspace  approximation.  Random  projection  methods  [36]  are  alternate  choices  for  low-rank 
matrix  approximation,  and  may  produce  smaller  approximation  errors  than  minimal  single-vector 
Krylov  subspaces.  Random  projection  methods  also  enjoy  immunity  to  the  loss  of  orthogonality  that 
is  problematic  for  Lanczos-type  methods  for  generating  Krylov  subspaces.  Nevertheless,  random 
projection  methods  have  computational  drawbacks;  compute  costs  excluding  sparse  matrix-vector 
costs  scale  quadratically  with  the  dimension  of  the  subspace  produced.  These  non-sparse  operations 
dominate  compute  costs  when  the  input  matrix  is  sufficiently  structured  and  sparse.  Substitution 
of  a  block  Krylov  subspace  with  50%  smaller  block  will  lead  to  compute  time  improvements;  these 
compute  time  improvements  allow  for  one  to  either  perform  fewer  Krylov  subspace  iterations  and  be 
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faster  than  random  projections,  or  perform  more  iterations  in  an  attempt  to  produce  smaller  errors 
in  equivalent  compute  times.  It  is  possible  to  produce  expressions  that  predict  cases  for  which  a 
short  block  Krylov  subspace  will  construct  a  low-rank  approximation  with  smaller  error  than  random 
projections.  A  50%  reduction  in  block  size  will  also  limit  loss  of  orthogonality  when  the  number  of 
Lanczos  iterations  is  limited  to  between  2  to  4. 

An  advantage  that  random  projection  matrix  approximation  methods  enjoy  is  robustness  to  loss 
of  orthogonality  issues  that  afflict  the  Lanczos-type  Krylov  subspace  algorithms.  Indeed,  loss  of 
orthogonality  due  to  round-off  error  has  been  a  constant  source  of  difficulty  in  Lanczos-type  algo¬ 
rithms  [32],  and  prevented  the  adoption  of  the  Lanczos  method  [44]  until  it  was  shown  that  loss  of 
orthogonality  and  eigenvalue  convergence  were  linked  [55].  Even  after  that  discovery,  methods  to 
correct  loss  of  orthogonality  [16, 57,  74]  are  still  necessary  to  produce  accurate  eigenvalue  calcula¬ 
tions.  Indeed,  incorporation  of  an  explicit  reorthgonalization  step  into  the  Lanczos  process  “destroys 
the  simplicity  of  the  Lanczos  procedure”  [16].  All  Lanczos  procedures,  block  or  single-vector,  will 
lose  orthgonality  provided  a  sufficient  number  of  iterations.  However,  loss  of  orthogonality  is  tied 
to  both  eigenvalue  convergence  and  the  number  of  iterations;  this  was  shown  for  the  single-vector 
case  in  [73].  We  extend  this  result  to  the  block  Lanczos  algorithm  in  Theorem  9;  this  implies  that  a 
limited  number  of  iterations  will  likewise  limit  loss  of  orthogonality. 

In  contrast  to  Lanczos  methods  to  generate  Krylov  subspaces,  the  projection  methods  introduced 
in  Algorithm  3  are  wholly  robust  to  loss  of  orthogonality  up  to  machine  precision.  This  robustness 
is  a  result  of  the  lack  of  true  iteration  in  the  random  projection  algorithms.  Orthogonality  is  main¬ 
tained  completely  throughout  the  entire  random  projection  process,  provided  that  round-off  error 
does  not  lead  to  loss  of  linear  independence  of  A^O.  This  robustness  does  come  at  a  price;  the  com¬ 
putational  complexity  of  the  random  projection  method  is  the  same  as  single-vector  Lanczos  with  a 
full  Gram-Schmidt  reorthogonalization  step  at  each  iteration.  In  most  cases,  the  dense  linear  alge¬ 
bra  operations  required  by  the  random  projection  method  are  more  expensive  than  the  (likely  sparse) 
matrix-vector  products  and  dense  linear  algebra  operations  in  the  ordinary  Lanczos  algorithm.  When 
sparse  matrix-vector  products  used  in  the  random  projection  method  are  more  expensive  than  the 
QR  factorizations  or  matrix-matrix  projections,  a  different  set  of  compute  time  advantages  emerges 
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for  the  block  Lanczos  algorithm  over  random  projections.  In  these  cases,  the  extra  work  of  forming  an 
orthonormal  subspace  and  then  re-projecting  the  input  matrix  into  that  subspace  puts  the  random 
projection  method  at  a  disadvantage  compared  to  classical  Lanczos  iteration. 

The  computational  complexity  of  random  projection  methods  are  a  result  of  their  choice  of  solu¬ 
tion  subspace:  spanlA^Xol  for  Xq  e  and  p  e  N.  We  contrast  this  with  the  block  Krylov  subspace 
span{Xo,AXo,A^Xo,A'“^Xo}.  To  generate  an  orthonormal  matrix  for  projection  into  spanlA^Xol, 
a  Gram-Schmidt  or  equivalent  process  must  be  used  to  orthogonalize  and  normalize  the  columns 
of  A^Xq.  These  reorthogonalization  methods  all  have  complexity  0(nk^).  Block  Krylov  subspace 
algorithms,  such  as  the  block  Lanczos  algorithm  [16]  have  equivalent  complexity  for  block  size  k. 
Shrinking  the  block  size  with  s  as  6  =  k/s  will  render  each  block  Lanczos  iteration  times  as  fast 
as  the  random  projection  method,  which  enables  one  to  either  generate  a  minimal  block  Krylov 
subspace  and  be  faster  than  the  random  projection  method  or  perform  more  than  s  iterations  and 
attempt  to  produce  better  eigenvalue  approximations  than  the  random  projection  method.  Moreover, 
if  s  is  small  enough,  then  such  a  block  Lanczos  method  will  not  require  any  reorthogonalization. 

To  obtain  superior  compute  costs  via  block  size  shrinkage,  we  propose  new  approaches:  the 
“shrink-and-iterate”  approach  and  the  hybrid  block  Krylov-random  projection  approach.  These  meth¬ 
ods  apply  a  block  size  shrinkage  factor  of  s,  and  perform  some  number  of  Krylov  subspace  iterations 
to  obtain  a  subspace  of  dimension  k.  For  practicality,  s  must  be  small;  we  consider  only  the  case  of 
s  =  2;  larger  s  leads  to  more  loss  of  orthogonality  and  more  storage  of  Lanczos  vectors.  The  “shrink- 
and-iterate”  approach  uses  s  =  2  and  performs  3  or  4  block  Lanczos  iterations.  The  hybrid  ran¬ 
dom  projection  and  Krylov  subspace  method  also  sets  s  =  2  and  uses  Jlf2(A,A^Xo)  to  approximate 
^4(A,Xo)  for  even  more  compute  savings  and  more  robustness  to  round-off  error.  In  fact,  we  show 
that  the  block  Lanczos  algorithm  can  be  modified  so  that  use  of  Xf2(A,A^Xo)  is  as  stable  as  the 
random  projection  method,  and  produce  a  new  modified  block  Lanczos  method  that  incorporates  a 
refinement  step  similar  to  refined  Gram-Schmidt  [4]  that  renders  the  hybrid  method  as  stable  as 
random  projections  in  practice. 

We  wish  to  produce  expressions  that  can  predict  if  the  random  projection  method  or  the  shrink- 
and-iterate  approach  will  produce  a  low-rank  approximation  of  the  input  matrix  with  smaller  error. 
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Figure  3.1:  Computational  costs  and  singular  value  approximation  errors  of  truncated  SVD  versus  random 
projections  and  proposed  block  Krylov  accelerations.  The  use  of  block  Krylov  subspaces  with  smaller  block 
sizes  may  yield  smaller  error  with  the  same  computational  effort. 


These  expressions  would  also  lend  insight  into  what  spectral  properties  will  determine  if  random 
projections  or  the  shrink-and-iterate  approach  will  produce  better  approximations.  The  existing  ex¬ 
pressions  lack  the  tightness  required  to  compare  the  two  methods  in  question;  we  produce  a  new 
expression  for  the  norm  of  the  low-rank  approximation  from  the  random  projection  method  and  the 
norm  of  the  low-rank  approximation  from  The  latter  expression  can  be  used  to  predict 

if  either  the  shrink-and-iterate  approach  using  Jlf4{A,Xo)  or  the  hybrid  method  using  Xf2iA,A^Xo) 
will  produce  a  better  low-rank  approximation  than  the  random  projection  method.  Figure  3.1  shows 
the  existing  random  projection  method  and  the  proposed  shrink-and-iterate  approach  in  the  spec¬ 
trum  of  generic  dimension  reduction  methods.  When  and  how  eigenvalue  approximations  from  a 
block  Krylov  subspace  will  be  better  than  those  from  a  random  projection  method  depend  on  the 
spectrum  of  the  input  matrix,  the  number  of  dimensions  in  the  subspace  and  the  amount  of  storage 
available  for  iterations  beyond  s.  We  will  show  that  cases  for  which  leading  eigenvalues  converge 
quickly  for  the  random  projection  method  are  also  the  cases  for  which  the  shrink-and-iterate  ap¬ 
proach  or  the  hybrid  approach  will  produce  a  better  low-rank  approximation. 

We  begin  by  reviewing  the  random  projection  method  for  low-rank  matrix  approximation  and 
compare  its  projections  to  block  Krylov  subspaces  generated  by  the  block  Lanczos  algorithm.  We 
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note  the  loss  of  orthogonality  in  the  block  Lanczos  method,  and  propose  a  block  Lanczos  variation 
with  refinement  akin  to  refinement  in  the  Gram-Schmidt  orthonormalization  algorithm.  We  then 
investigate  convergence  of  eigenvalues  in  random  projection  and  block  Krylov  subspaces  and  de¬ 
velop  new  bounds  for  both  algorithms  to  enable  comparison  of  convergence  with  respect  to  low-rank 
approximation.  We  conclude  with  stability  analysis  of  the  block  Lanczos  method. 

3.1  Random  projections  for  low-rank  matrix  approximation 

Random  projections  may  be  considered  quasi-Krylov  subspace  methods.  Halko  et.  al.  proposed  a 
variety  of  them  in  [36],  but  we  focus  on  the  “direct  eigenvalue  approximation  for  Hermitian  matri¬ 
ces  with  random  projections”  algorithm.  This  algorithm  is  essentially  the  Raleigh-Ritz  orthogonal 
projection  method  using  normalized  power  iteration  [4,  50]  to  generate  the  projection  matrix.  The 
algorithm  is  introduced  in  Algorithm  3  herein.  This  method  is  presented  as  an  eigenvalue  approx¬ 
imation  method;  to  effect  a  low-rank  matrix  approximation,  one  may  simply  use  the  Ritz  vector 
matrix  V  =  QU  and  the  Ritz  value  matrix  to  produce  the  low-rank  approximation  as  =  V&V^ . 
If  A  is  rectangular,  then  the  Ritz  vector  matrix  can  either  be  used  for 

(3.1)  A'*^  =  V^A 
if  V &V^  ~  AA^  or  used  to  approximate  A  directly  as 

(3.2)  A'*^  =  y05 

if  V&V^  ~  A^ A.  The  square  root  of  0  always  exists,  provided  that  no  null  eigenvector  of  A  is  in 
span{A^n};  this  square  root  is  also  easy  to  compute  as  0  is  diagonal. 

This  algorithm  is  simple  and  elegant.  We  have  also  remarked  that  it  is  robust  to  the  round-off 
errors  that  cause  loss  of  orthogonality  in  Lanczos  algorithms.  One  may  observe  that  no  true  iteration 
occurs  in  this  algorithm;  all  orthogonality-enforcing  operations  work  on  Q  as  a  whole. 

The  algorithm  does  not  explicitly  specify  a  method  for  generating  the  projection  basis  Q;  many 
candidate  methods  are  studied  in  [36].  However,  the  QR  factorization  QR  =  A^D,  is  suggested  and 
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forms  the  basis  for  the  probabilistic  analyses  of  the  error  of  the  algorithm  presented  in  [36].  Use 
of  that  same  Q  also  allows  for  more  direct  anal3dic  comparison  against  traditional  Krylov  subspace 
algorithms.  The  choice  of  span{A^f2}  is  further  suggested  by  its  relationship  to  Harmonic  Ritz  vec¬ 
tors  [56];  the  space  span{A^f2}  is  the  minimal  basis  for  Harmonic  Ritz  vectors,  which  are  minimum- 
residual  approximations  to  eigenvectors  rather  than  the  Galerkin  approximations.  Harmonic  Ritz 
vectors  have  been  noted  as  good  choices  for  a  subspace  for  the  Rayleigh-Ritz  method  [52]  or  choices 
for  restarts  of  the  Lanczos  process  [41]. 

The  algorithmic  complexity  of  the  method  is  dominated  by  the  QR  factorization  on  line  1  and  the 
formation  of  A  on  line  2.  The  matrix  A^U  will  he  nxk  and  each  basis  vector  i<k  must  be  orthogo- 
nalized  against  all  i  —  1  previous  vectors;  therefore  the  resulting  complexity  is  0{nk^).  The  formation 
of  A  on  line  2  also  requires  k  pairwise  inner  products,  for  a  complexity  of  0(nk^)  as  well.  Overall,  the 
method  requires  4k^  +  4k^n  +  4knnnz  —  kn  FLOPS  to  generate  a  ^-dimensional  approximation,  where 
A  has  Unnz  non-zero  elements. 

3.2  Block  Lanczos  for  low-rank  approximation 

Block  Krylov  subspaces  may  be  used  to  generate  low-rank  matrix  approximations  much  in  the  way 
that  single-vector  Krylov  subspaces  are  used  for  low-rank  matrix  approximations  in  [15,  75].  An 
advantage  of  a  Krylov  subspace  method  is  that  it  produces  both  an  orthonormal  basis  Q  for  the 
projection  subspace  Jti(A,xo)  and  the  projection  Tk^k  of  A  in  that  subspace.  This  eliminates  the  need 
for  an  expensive  0{nk^)  operation  to  project  A  onto  the  solution  subspace.  Moreover,  the  projected 
matrix  T^  k  has  band  tridiagonal  structure,  which  makes  manipulation  of  it  more  efficient  than  if  it 
had  arbitrary  structure.  Block  Lanczos  methods  are  widely  used  for  solving  eigenproblems  [16,31,33, 
88],  and  are  also  used  in  low-rank  matrix  approximation  problems  for  model  reduction  [25,27,34,35]. 
The  block  Lanczos  method  may  exhibit  faster  convergence  of  eigenvalues,  especially  when  they  are 
tightly  clustered  [67].  The  block  Lanczos  method  requires  4b^kn  +  2bknnnz  FLOPS  to  produce  a  bk- 
dimensional  subspace,  where  b  is  the  block  size.  If  6  =  k/2,  then  the  block  Lanczos  routine  requires 
3k^n  +  2knnnz  FLOPS. 
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In  the  case  that  A  is  square  and  Hermitian,  then  we  may  simply  approximate  A  with 

(3.3)  A^^^  =  QTk,kQ'^. 

If  A  is  rectangular,  then  we  generate  JIfi(A^A,xo)  and  may  use  the  Cholesky  factorization  LL^  = 
Tk^k,  as  Tk^k  is  positive-definite  provided  that  no  null  eigenvector  of  A^A  has  converged  in  the  Krylov 
subspace.  Then  we  may  use 

(3.4)  A‘^>  =  Q^L^. 

These  formulas  hold  for  either  single-vector  or  block  Krylov  subspaces. 

We  note  that  the  literature  examples  of  Krylov  subspaces  for  low-rank  matrix  approximation  use 
minimal  Krylov  subspaces  Jffi{A,xo)  with  i  =  k.  That  is,  to  produce  a  ^-dimensional  approximation, 
the  Krylov  subspace  J14(A,xo)  is  used.  This  is  by  no  means  the  only  way  to  apply  a  Krylov  subspace 
to  a  low-rank  approximation  problem.  One  may  generate  a  Krylov  subspace  Jifi{A,xo)  with  i>k  and 
truncate  using  the  leading  k  Ritz  vectors.  The  low-rank  approximation  becomes 

(3.5)  A^^^  =  VkQk 

where  Tj,;  =  [7017^  with  Ritz  pairs  {uj,9j),  &k  =  diag(0i,02, •  •  •  ,0,^)  and  Vk  =  [Qui  Qu2  ...  Qukl 
Intuitively,  we  expect  that  more  iterations  will  lead  to  more  converged  Ritz  values  which  then  will 
have  larger  magnitudes,  and  that  the  resulting  low-rank  approximations  will  be  better. 

The  block  Lanczos  algorithm  [4, 16]  is  the  block  version  of  the  single-vector  Lanczos  algorithm 
used  in  [15]  for  generating  a  orthonormal  basis  for  the  Krylov  subspace  ^i(A,xo).  The  algorithm 
is  presented  in  Algorithm  2.  The  salient  difference  in  use  is  that  the  single-vector  Lanczos  method 
accepts  a  vector  xo  and  produces  an  orthonormal  basis  for  Jifi{A,xo)  and  the  block  Lanczos  method 
accepts  a  matrix  Aq  to  produce  Jti(A,  Aq).  Both  methods  are  based  on  the  Lanczos  recurrence 

(3.6)  Qj+iRj+i  =  AQj  —  QjAj  —  QjRj-i 

where  Aj  and  5;  are  from  the  Lanczos  algorithm,  with  Aj  =  QTAQi.  The  Lanczos  recurrence  gives 


31 


3.  Short  block  Krylov  subspaces  for  low-rank  approximation 


a  three-term  relation  for  enforcing  orthogonality  between  basis  vectors  of  the  Krylov  subspace.  Note 
that  the  block  Krylov  subspace  algorithm  is  simply  a  generalization  of  the  single  vector  Lanczos 
algorithm,  and  reduces  to  it  when  Xq  has  1  column.  Though  the  Lanczos  recurrence  holds  in  ex¬ 
act  arithmetic,  in  finite-precision  floating  point  arithmetic  the  equality  is  not  exact.  Rel3dng  solely 
on  (3.6)  will  lead  to  a  gradual  loss  of  orthgonality  in  the  columns  of  Q  as  iterations  proceed. 

3.2.1  Block  Lanczos  for  with  refinement  for  improved  stability 

Round-off  error  causes  loss  of  orthogonality  for  several  reasons,  but  lost  orthogonality  compounds 
over  iterations.  Moreover,  round-off  error  may  cause  adjacent  Lanczos  blocks  to  be  non-orthogonal. 
Therefore  minimizing  round-off  error  produced  early  in  Lanczos  iteration  will  help  maintain  stabil¬ 
ity.  The  steps  on  lines  3  and  4  are  a  source  of  initial  error.  These  lines  are  equivalent  to  a  classical 
Gram-Schimdt  orthgonalization  of  R  against  spanlQj}.  However,  R  -QiAi  will  not  be  completely 
orthogonal  to  Qi  with  round-off  error.  Adding  an  extra  refinement  step  similar  to  the  modified  Gram 
Schmidt  process  results  in  Algorithm  4,  the  block  Lanczos  method  with  a  modified  orthogonalization 
step.  The  addition  of  the  extra  stabilization  step  at  line  6  has  apparently  not  been  suggested  previ¬ 
ously.  The  extra  stability  added  by  the  refinement  would  only  serve  to  delay  the  loss  of  orthogonality 
between  Lanczos  basis  vectors;  round-off  error  would  compound  anyway  and  some  reorthogonal- 
ization  routine  would  still  be  necessary.  We  suspect  that  this  is  the  reason  that  a  refinement  step 
augmentation  for  the  Lanczos  algorithm  has  not  been  proposed  before. 

The  addition  of  refinement  may  only  maintain  acceptable  orthogonality  for  a  few  extra  iterations, 
which  is  not  significant  if  the  number  of  Lanczos  iterations  is  assumed  to  be  large.  However,  if  the 
number  of  iterations  is  fixed  beforehand  and  small,  then  the  addition  of  a  refinement  step  may  be 
sufficient  to  maintain  orthgonality  and  eliminate  the  need  for  reorthogonalization.  This  modification 
adds  a  non-trivial  cost  to  the  classical  Lanczos  algorithm,  as  the  projections  QJR  are  among  the 
most  expensive  operations  per  iteration,  assuming  the  cost  of  the  sparse  matrix-vector  products  AQj 
is  negligible.  This  method  does  reduce  round-off  error  in  the  initial  phase  of  the  algorithm;  even 
if  only  two  iterations  are  performed,  the  error  due  to  lines  3  and  4  in  Algorithm  2  may  leave  the 
maximum  cosine  between  basis  vector  much  larger  than  machine  precision,  but  lines  5  and  6  in 


32 


3.  Short  block  Krylov  subspaces  for  low-rank  approximation 


Algorithm  4  Block  Lanczos  with  refinement 
Require:  a  priori  chosen  start  block  Qi 

1:  R  ^AQi 

2:  for  J  =  1  —  ^  do 

3:  Aj^QjR 

4:  R  ^R- QjAj 

5:  S^QJR 

6:  R  ^R- QjS 

7:  QR  factorize  R  =  Qj+iBj+i 

8:  R  AQ j-\-i 

9:  R  ^R-QjBj^^ 

10:  end  for 

Ai  Bl 

B2  A2  B^ 

11:  return  [Qi  Q2  ...], 

B3  As  Bl 
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Algorithm  4  reduce  this  error.  The  minimization  of  initial  error  not  only  improves  the  case  for  2 
iterations,  but  implies  slower  loss  of  orthogonality  in  the  next  few  iterations.  The  extra  refinement 
adds  2b^kn  FLOPS  to  the  ordinary  block  Lanczos  method. 

3.2.2  The  shrink-and-iterate  approach 

The  complexity  of  the  block  Lanczos  algorithm  is  determined  by  the  block  size  b,  the  number  of 
columns  of  Qi  =  Xq.  The  matrix-matrix  product  on  line  4  and  the  QR-factorization  on  line  5  require 
0(nb^)  operations.  This  is  the  same  complexity  as  the  random  projection  methods  in  Algorithm  3 
when  b  =  k  and  the  block  Lanczos  algorithm  is  asymptotically  as  expensive  as  the  random  projection 
method.  If  one  uses  a  smaller  block  size,  then  each  iteration  of  the  block  Lanczos  routine  will  be 
accelerated.  For  example,  if  the  block  size  is  b/2,  then  each  iteration  of  block  Lanczos  will  be  roughly 
four  times  as  fast  as  with  a  block  size  of  b.  One  may  expect  to  perform  four  Lanczos  iterations  in 
roughly  the  same  amount  of  time  as  is  required  to  form  a  ^-dimensional  subspace  using  the  random 
projection  method  in  Algorithm  3;  the  resulting  Krylov  subspace  will  be  2^-dimensional.  In  general, 
one  may  shrink  the  block  size  by  a  block  shrinkage  factor  of  s  and  generate  J^2(A,xo). 

The  above  complexity  analysis  suggests  that  shrinking  block  sizes  will  always  lead  to  compu¬ 
tational  gains,  provided  that  the  sparse  matrix-vector  products  needed  to  form  the  AQi  are  not 
expensive.  The  irresistible  conclusion  that  setting  the  block  shrinkage  factor  s  =  k,  resulting  in 
single-vector  Lanczos  and  generating  a  ^^-dimensional  subspace  is  always  the  best  choice  may  be 
tempered  by  two  important  observations: 

1.  the  increasingly  long  Krylov  subspaces  generated  by  the  shrink-and-iterate  scheme  described 
above  will  require  storage  of  all  k‘^  Lanczos  basis  vectors,  and 

2.  the  Lanczos  algorithm  will  require  reorthogonalization  at  some  point  to  recover  orthgonality 
lost  due  to  round-off  error. 

Item  1  is  a  non-trivial  limitation;  memory  constraints  led  to  the  restarted  Lanczos  routines  [3,41] 
that  limit  the  number  of  Lanczos  vectors  stored.  We  eschew  restarted  methods  due  to  their  compu¬ 
tational  costs  and  algorithmic  complicatedness;  our  overarching  goal  is  to  develop  a  block  Lanczos 
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method  that  requires  no  stabilization  and  does  not  perform  many  iterations.  It  is  also  possible  to 
simply  not  generate  all  Lanczos  vectors,  but  item  2  limits  the  number  of  iterations  available  with¬ 
out  any  stabilization.  In  order  to  both  present  modest  storage  requirements  and  simultaneously 
require  no  stabilization,  any  increases  in  iteration  must  be  limited  to  small  s.  Spectral  properties 
govern  both  the  convergence  of  eigenvalues  in  Krylov  subspaces  and  loss  of  orthgonality  in  the  basis 
vectors.  Therefore,  spectral  properties  of  the  input  matrix  A  determine  the  cases  for  which  block 
Lanczos  with  a  block  size  shrinkage  of  s  will  produce  a  better  low-rank  approximation  than  Algo¬ 
rithm  3  and  the  cases  for  which  block  Lanczos  will  not  require  stabilization. 

From  the  above  complexity  analysis,  it  is  not  difficult  to  see  that  the  shrink- and-iter ate  method 
previously  described  can  allow  the  block  Lanczos  method  to  generate  a  Krylov  subspace  J(fi(A,Xo)  of 
dimension  i  >  k  in  the  same  amount  of  time  necessary  for  the  random  projection  method  to  produce 
a  subspace  of  size  k.  Though  truncation  by  Ritz  vectors  may  lead  to  better  low-rank  approximations 
of  A  when  the  leading  k  Ritz  values  of  the  Krylov  subspace  are  larger  than  the  Ritz  values  produced 
by  the  random  projection  method  and  the  number  of  iterations  beyond  k  available  to  a  block  Lanc¬ 
zos  routine  are  constrained  both  by  memory  and  by  loss  of  orthogonality,  the  number  of  iterations 
and  block  size  also  influence  convergence  of  eigenvalues  in  the  Krylov  subspace.  Through  analysis 
of  the  spectrum  of  A  we  may  develop  sufficient  conditions  for  a  block  Lanczos  method  with  more 
iterations  to  lose  negligible  orthogonality  and  still  produce  a  better  low-rank  approximation  than 
the  random  projection  method.  However,  we  first  investigate  a  hybrid  random-projections  Krylov 
subspace  method  that  aims  to  combine  the  good  stability  of  the  random  projection  method  and  the 
improved  computational  complexity  of  the  shrink-and-iterate  approach. 

3.2.3  The  hybrid  random  projection-Krylov  subspace  method 

Both  random  projections  and  block  Lanczos  algorithms  have  their  own  particular  advantages  and 
drawbacks.  Random  projections  can  produce  good  approximations  with  no  loss  of  orthogonality  in 
the  basis  they  produce,  but  the  computational  complexity  is  quadratic  in  the  number  of  dimensions 
produced.  Block  Krylov  subspaces  can  have  better  computational  complexity  from  shrinking  the 
block  size  by  a  factor  of  s  which  allows  for  a  larger  solution  subspace  to  be  produced  with  the  same 
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complexity  as  random  projections.  Then  one  can  either  perform  iterations  and  potentially  get 
a  better  low-rank  approximation,  or  perform  fewer  than  iterations  and  simply  be  faster  than 
random  projections.  Ideally,  we  would  like  to  produce  a  low-rank  approximation  with  smaller  error 
than  the  random  projection  method,  avoid  storage  of  an  s^-dimensional  subspace,  have  no  loss  of 
orthogonality,  and  be  faster.  The  hybrid  random  projection-Krylov  subspace  method  attempts  to 
combine  the  two  algorithms  to  get  the  best  qualities  of  both.  The  hybrid  approach  is  presented  in 
Algorithm  5. 

The  hybrid  approach  is  rather  straightforward;  the  random  projection  method  is  used  to  produce 
a  start  block  for  the  block  Lanczos  method  with  refinement.  The  motivation  for  the  hybrid  approach 
Algorithm  5  Hybrid  random  projection-Krylov  subspace  approach 

Require:  a  priori  chosen  random  start  block  Xq  with  linearly-independent  columns,  power  factor  p 
1:  QR  factorize  Xi,R  =  APXq 

2:  return  Q,T  from  2  iterations  of  Algorithm  4  or  2  with  start  block  X\. 


is  to  approximate  ^s2(A,Xo)  with  ^s^_p{A,APXo).  Increasing  the  value  of  p  reduces  the  number 
of  basis  vectors  stored  and  also  reduces  the  number  of  Lanczos  iterations.  Clearly,  it  is  desirable  to 
store  fewer  basis  vectors,  and  intuitively,  less  orthogonality  will  be  lost  with  fewer  Lanczos  iterations. 
As  was  noted  in  the  discussion  of  Algorithm  4,  the  block  Lanczos  algorithm  run  for  only  2  iterations 
is  equivalent  to  a  full  Gram-Schmidt  process  when  refinement  is  used.  Setting  p  =  s  =  2  and  using 
Algorithm  4  will  result  in  no  more  loss  of  orthogonality  than  the  random  projection  method. 

We  conclude  our  discussion  of  all  block  Lanczos  methods  by  noting  that  our  complexity  analysis 
has  assumed  that  sparse  matrix-vector  products  are  of  negligible  cost  compared  to  the  dense  matrix 
operations  in  algorithms  3  and  2.  This  is  not  always  the  case;  if  A  is  dense  then  forming  AQj  will  be 
more  expensive  than  the  cost  of  QR  factorizing  the  block  vector  AQj.  Also,  if  A  is  implicitly  defined 
in  terms  of  a  rectangular  u  x  m  R  with  m  »  n,  then  the  product  AQj  may  be  more  expensive  than  the 
QR  factorization  of  AQj  even  when  B  is  sparse.  If  the  costs  of  sparse  matrix-vector  products  are  the 
dominant  compute  costs  and  no  substantial  computation  advantages  are  realized  for  dense  matrix 
operations  over  sparse  matrix-vector  operations,  3  iterations  of  block  Lanczos  with  a  50%  block  size 
may  still  be  less  computationally  expensive  than  random  projections  to  produce  a  ^-dimensional 
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approximation.  The  block  Lanczos  method  simply  requires  fewer  sparse  matrix-vector  products  than 
the  random  projection  method. 

3.3  Convergence  Analyses 

We  have  observed  that  shrinking  the  block  size  can  lead  to  Lanczos  iteration  that  is  faster  than 
the  random  projection  algorithm.  This  speed  advantage  alone  is  insufficient  to  recommend  Lanc¬ 
zos  iteration  with  a  shrunken  block  over  the  random  projection  algorithm.  Rather,  the  complexity 
observations  must  be  combined  with  error  bounding  analyses  on  both  Algorithm  3  and  Algorithm  2 
to  determine  those  cases  for  which  block  Lanczos  with  a  smaller  block  will  produce  better  low-rank 
approximations  than  Algorithm  3. 

Bounds  for  low-rank  matrix  approximation  error  exist  for  Algorithm  3  and  for  eigenvalue  approx¬ 
imations  from  Krylov  subspaces.  These  existing  bounds  are  not  useful  in  combination  with  each  other 
to  determine  cases  for  which  the  shrink-and-iterate  approach  will  produce  a  low-rank  approximation 
with  lower  error  than  the  random  projections  method.  Both  the  bounds  on  the  random  projections 
error  and  the  eigenvalue  approximation  error  bounds  for  Krylov  subspaces  produce  lower  bounds, 
and  for  comparison  of  the  two  algorithms  we  require  upper  bounds  for  one  and  lower  for  the  other. 
Moreover,  existing  Krylov  subspace  bounds  were  originally  intended  to  be  elegant  expressions  for  the 
asymptotic  worst-case  error.  Their  derivation  uses  simplifications  that  drastically  improve  clarity  at 
the  expense  of  tightness.  In  some  cases,  they  are  rather  pessimistic;  we  have  observed  that  they 
are  so  pessimistic  for  short  Krylov  sequences  as  to  not  be  useful  for  comparison  with  the  random 
projections  method. 

Due  to  these  limitations  of  existing  bounds,  we  derive  results  that  allow  for  direct  comparison  of 
random  projections  and  the  shrink-and-iterate  application  of  Krylov  subspaces.  These  results  give 
exact  formulations  of  the  norm  of  low-rank  approximations  for  the  random  projection  method  and 
the  shrink-and-iterate  approach,  so  that  tightness  is  not  an  issue.  These  formulations  discard  the 
asymptotic  form  of  previous  bounds  of  eigenvalue  approximations  from  Krylov  subspaces  to  exploit 
the  shortness  of  the  Krylov  sequence  in  the  shrink-and-iterate  approach.  In  particular,  we  aim 
to  show  cases  for  which  a  Krylov  sequence  of  length  2  and  a  start  block  of  size  b  will  produce  a 
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better  low-rank  approximation  than  random  projections  with  a  start  block  of  size  2b.  Before  deriving 
these  new  expressions,  we  review  existing  bounds  on  random  projections  errors  from  Halko  [36]  and 
eigenvalue  approximations. 

3.3.1  Review  of  existing  bounds  for  random  projections  and  Krylov  subspaces 

We  first  review  the  available  bounds  for  low-rank  approximations  for  random  projections  and  Krylov 
subspaces.  For  random  projections,  we  have  an  elegant  result  due  to  Halko  et.  al. 

Theorem  5  (Halko,  Martinsson  and  Tropp  [36]).  Let  A  e  be  the  input  matrix  with  partitioned 
singular  value  decomposition 

(3.7)  A  =  [UiU2\ 

22 

with  Ui,  Zi  and  Vi,  having  k  columns.Then  the  error  in  low  rank  approximation  using  the 
subspace  generated  by  Algorithm  3  operating  on  AA^  is  bounded  as 

(3.8)  ||A-i<^^||  <  IIZ2II  +  ||Zi£l2nt  11^^^ 

with  fli  =  D.,  Q.2  =  f/J Q.  and  random  sample  matrix  D.  for  ||  •  ||  being  either  the  spectral  or  Frobenuis 

norm. 

Eigenvalue  approximation  bounds  exist  in  many  forms  for  Krylov  subspace  approximations. 
These  eigenvalue  approximation  results  may  be  used  to  construct  low-rank  approximation  bounds 
for  II A  —  A^^^ll  indirectly  for  cases  which  produce  A^*^  by  restricting  it  to  a  subspace.  From  Theo¬ 
rem  4,  we  have  ||A  —  A^^^|||,  =  tr(AA^)  — tr(A^*^A^^^^).  When  A  is  square  and  positive  semidefinite, 
we  have  ||A  —  A*^^IU  =  tr(A  —  A^^^)  =  tr(A)  —  tr(A^^^).  In  both  cases,  we  can  infer  the  improvement  in 
low-rank  approximation  error  directly  from  the  improvement  in  the  norm  of  A^^^;  therefore  conver¬ 
gence  of  eigenvalue  approximations  suggests  low-rank  approximation  error.  The  following  theorem 
is  a  generalization  of  Theorem  3  to  convergence  of  singular  values  of  A  in  the  spaces  JFj(AA^,xo) 
and  JFj(A^A,  Axo). 
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Theorem  6  (Golub,  Luk  and  Overton  [31]).  Let  A  e  be  the  input  matrix  with  singular  values 
Uinf-  Consider  the  matrix  AA^  with  spectral  decomposition  AA^  =  U  AU^  and  eigen¬ 
values  Ai  =  (T^  >  A2  =  0-2  -  •  •  •  -  '^Lf  “  ^fnf  corresponding  eigenvectors  u  1 ,  U2 ,  •  •  • ,  ^inf-  Apply  s  steps 
of  the  block  Lanczos  algorithm  in  Algorithm  2  with  block  size  b  to  generate  J(4(AA^,Xo)  and  the  pro¬ 
jection  Tg^s  of  AA^  e  JCgiAA'^ ,Xq).  Use  a|®^  to  denote  the  eigenvalue  ofTg  g-  Then  for  i  =  l,2,...,b 
the  eigenvalue  estimate  a|®^  =  is  hounded  as 


(3.9) 


0<a^ 


inf-' 


tan^O 


for 

-  (7?  , 

(3.10)  ri  =  —o — ^ 

and  6  =  cos~^t  where  t  is  the  smallest  singular  value  of[ui  U2  ...  Ub]'^Xo. 


A  nearly-identical  result  is  also  available  for  the  trailing  singular  values  of  A  [17].  It  bounds  er¬ 
rors  of  approximations  of  small  eigenvalues;  it  therefore  gives  upper  bounds  on  the  trailing  b  eigen¬ 
value  approximations  from  the  Krylov  subspace.  Then  simply  using  these  bounds  on  JC2{A,Xo)  does 
not  result  in  bounds  which  may  be  used  in  conjunction  an  upper  bound  on  ||A|^p||,  where  l|App||  is 
the  rank-^  approximation  produced  by  random  projections,  to  determine  which  method  will  produce 
a  smaller  error.  Another  serious  limitation  on  these  bounds  for  comparing  the  shrink-and-iterate 
approach  to  random  projections  is  that  they  do  not  provide  for  bounds  for  all  sb  eigenvalue  approx¬ 
imations  from  J(4(-^)-^o);  bounds  on  all  sb  eigenvalue  approximations  is  required  for  comparison 
against  ||App||.  The  limitation  on  bounding  interior  eigenvalues  is  an  inevitable  consequence  of  the 
methods  used  to  produce  the  bounds,  and  bounding  interior  eigenvalues  beyond  s  using  the  same 
methods  is  rather  difficult. 


3.3.2  Derivation  of  new  bounds 

Unfortunately,  neither  the  bounds  in  Theorem  5  or  those  in  6  are  useful  for  comparing  the  block 
Lanczos  and  random  projection  methods.  The  bound  in  (3.8)  gives  upper  bounds  for  the  error,  while 
we  require  lower  bounds  to  determine  when  block  Lanczos  will  perform  better.  Although  we  do 
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require  lower  bounds  on  the  eigenvalue  estimates  of  for  block  Lanczos  as  given  by  (3.9), 

the  bounds  only  provide  for  the  first  b  eigenvalue  estimates.  We  require  bounds  on  all  k  eigenvalue 
estimates,  not  just  the  leading  b  of  them.  Nevertheless,  the  Halko  bounds  do  present  us  with  a  useful 
starting  point  to  derive  the  necessary  devices  to  compare  random  projections  and  Krylov  subspaces. 

We  begin  with  a  formula  that  characterizes  exactly  IIA^^pH  for  the  random  projection  method. 
We  note  that  this  expression  may  be  combined  with  the  Kaniel-Paige-Saad  bounds  to  produce  an 
expression  for  predicting  which  method  will  produce  a  better  low-rank  approximation  error,  but 
the  pessimism  in  the  Kaniel-Paige-Saad  bounds  limits  its  usefulness.  However,  the  expression  for 
IIAppll  not  only  gives  an  expression  for  the  value  of  random  projections,  but  also  may  be  applied 
to  generate  a  formula  for  part  of  IIA^^II.  With  that,  we  then  present  an  extension  that  gives  an 
expression  for  all  of  IIA^^II,  by  using  the  identities  ||A||j’  =  tr(A^A)  and  tr(A)  =  with  a  block 

decomposition  of  A^A.  The  bounds  for  random  projections  can  be  used  for  the  upper-left-hand  block 
of  A  restricted  to  Jt2(A,A^Xo),  and  the  extension  gives  the  trace  of  the  lower-right-hand  block  of 
A  restricted  to  J(f2(A,A^Xo).  As  Jt2(A,A^A’o)  c  ^4{A,Xq),  the  eigenvalue  approximations  from  A 
restricted  to  Jt2(A,A^A’o)  are  at  least  as  good  as  the  approximations  restricted  to  Ji^4{A,Xo).  We 
begin  with  proof  of  a  new  theorem  characterizing  the  error  of  random  projections  or  one  iteration  of 
the  block  Lanczos  process. 

Theorem  7.  Let  A  be  given  as  in  Theorem  5.  For  a  random  sampling  matrix  f2  e  with  QR 

factorization  Q.  =  XqS,  let  A**^  be  the  low-rank  approximation  obtained  by  Algorithm  3  with  input 
AA^.  Let  ai'>  02>  ■■■>  Uinf  be  the  singular  values  of  A  with  corresponding  left  singular  vectors 
ui,U2,---,Winf.  Then  ||A<*>|||  is  given  by 

rmn{m,n] 

(3.11)  I|A''^^|||=  ^  aY\\{j}PU^n)^'^LlFui\\l 

i=l 

To  prove  this  theorem,  we  first  introduce  and  prove  the  following  lemma. 

Lemma  1.  Let  A  be  an  arbitrary  real  matrix  and  AA^  =  be  the  spectral  decomposition  of  the 

positive  semi-definite  Gram  matrix  AA^ .  Then  for  any  eigenvector  ui,  the  norm  of  the  image  ui  ofui 
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in  the  space  S  =  span{(AA^Ff2}  is  given  by 

(3.12)  Wuih  =  a-f  II2. 

Proof.  Let  QR  =  (AA^yO.  be  the  QR  factorization  of  {AA^yO..  Then  Q  is  an  orthonormal  basis  for 
span{(AA^)^f2}  =  S  and 

(3.13)  Q  =  (AA^)PnR-^ 

as  R  is  nonsingular,  provided  that  no  null  eigenvector  is  in  span{fl}.  Then 

(3.14)  \\ui\\2  =  \\{{AA^)PnR-Yui\\2  =  \\R-'^if{AA^)Pui\\2  =  nf 

We  have  QR  =  (AA^)^n,  so 

(3.15)  (J(QR)  =  aiUI.^PU'^n)  =  ail^PU’^Q) 

where  o'(A)  represents  the  singular  values  of  A,  as  singular  values  are  invariant  under  the  orthonor¬ 
mal  rotations  Q  and  U .  Combining  (3.14)  and  (3.15)  we  get 

(3.16)  \\ui\\2  =  of\\{I?PU^^iy'^n^Ui\\2 

which  completes  the  proof  □ 

This  lemma  gives  a  a  priori  formula  for  the  norms  of  the  projections  of  eigenvectors  of  A  in  the 
space  span{(AA^)^f2}.  With  that,  we  can  bound  the  norm  of  A^^\ 

Proof  of  Theorem  7.  We  use  Ai(A)  to  represent  the  eigenvalue  of  A.  By  definition,  ||A|||,  =  tr(A^A)  = 

YJl-i  XiiA'^A)  for  any  A.  Also  AA^  ~  A'^A,  so  || A  |||.  =  tr(A^A)  =  tr(AA^).  From  the  definition  of  A^^^ 
in  (3.1),  we  have 

(3.17)  A<^>  =  Q^A 
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for  any  basis  Q  of  span{(AA^)^n}.  Then 

(3.18)  ||A''^^|||  =  tr(A^QQ^A)  =  tr(VZ^[7^QQ^[7Zy^). 

where  A  =  UXV'^  is  the  singular  value  decomposition  of  A.  With  the  similarity  transform  •  V , 

(3.19)  ||A<'^>|||  =  tr(Z[7^QQ^[7Z). 


Without  loss  of  generality,  we  order  the  diagonal  entries  of  Z  and  columns  of  U  such  that  crii  >  CT22  ^ 
. . .  >  cr inf  and  U  =  [uiU2  ...  winf].  Let  U  =  Q^U .  As  Q  is  a  basis  for  span{(AA^)^r2},  we  have 


(3.20) 


Ui\\2  =  of\\{J?PU'^nr'^^fui\\2 


from  Lemma  1.  Substituting  U  into  (3.19),  we  get 


(3.21)  ||A 


=  tr(Z(/^r7Z)= 

i=l 


2p  .T  ^ 
Ui  Ui 


=  1 


a- 


2P| 


i=l 


II2)" 


i=l 


which  completes  the  proof  □ 

Theorem  7  is  similar  to  Theorem  5  from  Halko  et.  al.  [36]  but  gives  an  exact  formulation  of  the 
Frobenius  norm  of  The  theorem  does  give  a  more  direct  means  of  comparison  of  the  differ¬ 
ence  in  Frobenius  norm  of  the  random  projection  method  and  block  Krylov  subspaces  for  low-rank 
approximation. 

With  Theorem  7,  we  may  begin  drawing  conclusions  regarding  matrices  for  which  a  shrink-and- 
iterate  scheme  will  produce  better  low-rank  approximations  than  the  random  projection  method. 
Combination  of  (3.11)  and  the  lower  bounds  on  eigenvalue  approximations  in  (3.9)  yields  the  follow¬ 
ing  proposition. 

Proposition  1.  Let  A  be  a  positive  semi-definite  matrix,  Xq  be  a  real  matrix  with  2b  orthonormal 
columns  and  AXq  defined,  let  A^p  be  the  low-rank  approximation  computed  by  the  random  projec¬ 
tions  method  in  Algorithm  3,  let  A^^  he  the  low-rank  approximation  generated  by  k  iterations  of  the 
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block  Lanczos  method  described  in  Algorithm  2  with  inputs  A  =  A  and  Qi  =  Xs  where  Xg  is  composed 
of  some  arbitrary  b  columns  of  Xq,  let  ji  defined  as  in  Theorem  6,  let  6  =  cos“^minsmin(ji„f 
where  S  is  a  subset  of  columns  of  Xq  with  |S|  =  6  and  Xg  =  [xj  xj  ...]  for  x;  e  S,  Xj  e  S.  Then  the 
squared  Frobenius  norm  of  is  strictly  less  than  the  squared  Frobenius  norm  of  A^^  whenever 


(3.22) 


II^RpIll  =  E  O^^WiJ^U^Xor^Xl  Ui  112 

i=l  i=l 


taird 


Tk-. 


.l-Ti 


<  iiaJ>ii 


We  remark  that 


when  A  is  rectangular,  we  may  substitute  the  positive  semi-definite  Gram  matrix  AA^ 


•  for  s  =  4  and  k  =  b  the  Krylov  subspace  generated  in  this  proposition  is  the  same  Krylov  sub¬ 
space  generated  by  the  “shrink-and-iterate”  method, 


•  that  this  result  may  be  valid  in  theory,  but  the  pessimistic  nature  of  the  Kaniel-Paige-Saad 
bounds  from  Theorem  6  limits  their  usefulness.  In  fact,  we  have  observed  no  case  in  which 
the  bounds  produce  sufficient  tightness  to  ever  predict  the  shrink-and-iterate  method  will  pro¬ 
duce  a  better  approximation,  even  when  it  outperforms  the  random  projections  method  by  a 
substantial  margin. 


For  further  tightness  we  consider  an  alternative  to  the  Kaniel-Page-Saad  bounds  in  the  right- 
hand  side  of  (3.22).  These  new  bounds  will  not  use  the  polynomial  methods  used  in  the  Kaniel- 
Page-Saad  bounds,  but  will  instead  use  a  more  direct  approach  with  the  assumption  that  the  Krylov 
sequence  length  is  at  most  2.  This  assumption  sacrifices  generality  in  numbers  of  iterations  in 
exchange  for  producing  a  formula  that,  when  combined  with  (3.11)  gives  the  value  for  ||A|^^||.  We 
intend  to  use  this  formula  to  give  a  value  for  IIA^^II  e  ^2iA,A‘^XQ),  which  in  turn  bounds  IIA^^II  e 
^i{A,XQ). 

The  Kaniel-Paige-Saad  bounds  provide  elegant  expressions  that  overcome  the  recurrence  rela¬ 
tions  that  would  be  otherwise  necessary  to  formulate  a  priori  bounds  on  eigenvalue  approximations 
in  Krylov  subspaces.  The  bounds  are  applicable  to  Krylov  subspaces  of  any  dimension.  When  one 
is  considering  bounds  for  Krylov  subspaces  of  arbitrary  length,  the  direct  expressions  for  eigenvalue 
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approximations  would  be  prohibitively  cumbersome  or  inaccurate.  However,  for  small  Krylov  se¬ 
quence  length  i,  reasonable  bounds  may  be  generated  directly.  To  that  end,  we  present  the  following 
proposition. 


Proposition  2.  Let  A  be  a  positive  semi-definite  matrix,  and  Xq  be  a  matrix  with  orthonormal 
columns  and  AXo  defined.  Consider  the  Krylov  subspace  Jt2(-A,A^Xo);  then  the  projection  of  A  into 
jC2{A,APXo)  is 


(3.23) 


B2  A2 


and  A\  and  B2  are  the  matrices  produced  by  the  block  Lanczos  algorithm. 


We  note  that 


•  the  block  Ai  is  the  low-rank  approximation  generated  by  the  random  projection  method  in 
Algorithm  3  with  inputs  A  =  A  and  H  =  A^Xq, 

•  the  trace  tr(T'2,2)  =  tr(Ai)-t-tr(A2), 

•  if  A  is  rectangular,  and  the  block  Lanczos  method  operates  on  AA^,  then  ||A^^^|||,  =  tr(T'2,2)- 

To  produce  bounds  on  nuclear  norm  or  squared  Frobenius  norm  error,  we  simply  require  a  corre¬ 
sponding  formula  for  tr(A2)  and  a  provision  that  eigenvalue  approximations  from  J(fi(A,  A^Aq)  are 
never  better  than  eigenvalue  approximations  from  JCi+p{A,XQ).  Therefore,  we  require  implements 
to  reason  about  how  eigenvalue  approximations  from  JCi{A,XQ)  compare  against  eigenvalue  approx¬ 
imations  from  J(fj_i(A,  AAq).  We  introduce  and  prove  the  following  lemma  to  that  end. 

Lemma  2.  Let  A  be  a  symmetric  positive  semi-definite  matrix  and  be  the  i*^  eigenvalue  esti¬ 
mate  from  A  restricted  to  JCj{A,Xo)  and  be  the  i*^  eigenvalue  approximation  of  A  restricted  to 

J(f;-i(A,A’o).  Then 

(3.24)  xf  > 

Proof.  Suppose  by  contradiction  otherwise.  Let  vt  be  Ritz  vectors  of  A  with  Vi  e  JCj{A,Xo)  and  Ui 
be  Ritz  vectors  of  A  with  Ui  e  JF,-i(A,A’o).  Since  ^•-i(A,A’o)  <=  J(^(A,A’o)  and  the  Ritz  vectors  Vi 
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form  an  orthonormal  basis  for  J^'j{A,XQ),  any  Ritz  vector  ui  e  Jlf;-i(A,Xo)  may  be  written  as  a  linear 
combination  of  Oj: 

j 

(3.25)  Ui  =  Y,CiVi. 

i=l 

Let  y  =  [oi  ...  Vj].  Then  V^AV  is  the  restriction  of  A  to  J^j{A,X()),  and  V^AV  =  Since  Ui  may 
be  expressed  in  terms  of  the  vectors  Vi,  VV^Ui  =  ui.  Let  U  =  [ui  ...  Uj-i\  and  U  =  V^U.  Then  we 
may  write  Ritz  values  as  the  diagonal  of 

(3.26)  diag([7^A[7)  =  diag([7^yy^Ayy^C7)  =  diag([7^A<-^'^[7). 

Since  the  matrix  A*-^^  is  also  diagonal,  then  its  eigenvalues  are  simply  its  diagonal  entries  and  its 
eigenvectors  are  the  element  vectors  e;.  Assume  without  loss  of  generality  that  diagonal  entries 
of  are  ordered  such  that  e^A^”^ei  >  Cg  A*-^^e2  >  ...  >  Then,  by  the  Courant  character¬ 

ization  of  eigenvectors,  there  is  no  other  set  of  orthonormal  vectors  {x;  |  1  <  i  <  j}  with  x^A^-^^xi  > 
x,  A^'*^X2  >  ...  >  x'^ A^j^Xfi  such  that  xf  A^-^^Xj  >  ef  But  from  our  assumption,  at  least  one  Ritz 

value  had  which  implies  that  at  least  one  Ritz  vector  has  uj AS^^ui  >  ej A^^ei,  which  is 

a  contradiction  as  the  Ritz  vectors  Ui  are  orthonormal. 

□ 

Now  that  we  have  shown  that  the  eigenvalue  approximations  from  A  restricted  to  Jif4{A,Xo) 
are  at  least  as  good  as  those  from  A  restricted  to  Jt2(A,A^A’o),  we  may  use  the  expression  from 
Theorem  7  to  quantify  the  trace  of  Ai  from  (3.35).  Now  we  develop  an  expression  for  tr(A2),  which  is 
simply  tr(QjAA^Q2).  We  use  a  strategy  similar  to  that  of  Theorem  7;  first  we  develop  a  formula  for 
the  image  ui  of  a  left  singular  vector  Ui  of  A  in  span{Q2},  and  then  use  that  to  generate  a  formula 
for  IIQJ -^  |||,.  We  begin  by  presenting  and  proving  the  following  lemma. 

Lemma  3.  Let  A  be  an  arbitrary  matrix  with  singular  value  decomposition  A  =  U'LV'^  and 

singular  values  ci  >  (T2  >  . . .  >  Umf,  and  Xq  be  an  orthonormal  nxb  matrix.  Let  QiR  =  (AA^I^Xq  and 
Q2  be  the  orthonormal  matrices  generated  by  2  iterations  of  the  block  Lanczos  algorithm  in  presented 
in  Algorithm  2  applied  to  AA^ .  Then  the  norm  \\ui  \\\  of  a  left  singular  vector  ui  of  A  in  span{Q2}  is 
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Ui 


2  _ 
2  “ 


,  2p+2.T73-l 

(a.^  u^R 


2 

2 


Proof.  Let  A  and  Xq  be  given  as  described.  Then  QiR  =  {AA^yXo  and  Q2  is  defined  as 

(3.28)  Q2B2  =  {AA^){AA^)PXoR-^  -  {AA^YXqR-^Ai 
with  Ai  =  (i?-^Zj(AA^)^)AA^(AA^)^Zoi?“^  Clearly 

(3.29)  {{AA^)P^^XoR-^  -  {AA^)PXoR-^Ai)B^^  =  Q2 


and  is  an  orthonormal  basis  for  span{Q2}- 

We  require  a  formula  for  the  norms  of  the  images  of  left  singular  vectors  of  A  projected  into 
span{Q2}-  If  Ui  is  a  left  singular  vector  of  A,  then  the  image  ui  in  span{Q2}  is 

(3.30)  uf((AA'^)P^^XoR~^-(AA'^)PXoR~^Ai)B2^  =  (a^P^^ufXoR~^-a^i^uJXoR~^Ai)B2^, 


which  completes  the  proof  □ 

Now  we  continue  on  to  prove  characterization  of  the  norm  of  IIQ2 -^|||,. 

Theorem  8.  Let  A  e  be  an  arbitrary  matrix  with  singular  value  decomposition  A  =  UYV'^  and 
singular  values  (ti  >  (T2  >  . . .  >  Umf,  and  Xq  be  an  orthonormal  nxb  matrix.  Let  QiR  =  {AA'^)pXq  and 
Q2  be  the  orthonormal  matrices  generated  by  2  iterations  of  the  block  Lanczos  algorithm  in  presented 
in  Algorithm  2  applied  to  AA^,  with  Ai  and  B2  also  from  the  block  Lanczos  algorithm.  Then  the 
squared  Frobenius  norm  ||A^Q2ll|'  is  bounded  as 

(3.31)  = 

i=l 


with 


U  i  —  Xq  U  i 
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for  singular  vectors  Ui  of  A, 


Q2B2  =  (AA'^)P^^XoR~^-AA'^XoR-^Ai 


and 

Ai  =  ((AA^)^Zoi?“^r(AA^)(AA^FZoi?“\ 

Proof  Since  ||QJA|||,  =  tr(A^Q2QjA),  then  we  note  that  A  =  UXV'^  and  by  substitution, 

(3.32)  IIQ2  A  III  =  tr(A^Q2Q2  A)  =  tr(yZ[7^Q2Q  J Ul.V'^). 


Since  trace  is  invariant  under  orthonormal  transformations  and  Qjt/  =  U , 

(3.33)  iY{VI.U'^Q2QlUXV'^)  =  iY(I.U^Q2QlUT)  =  iY{I.U^UI.)=j^a‘l\\ui\\l. 

i=l 

From  Lemma  3  we  have  ||uj ||  =  —  cr^^u^i?“^Ai)S9^.  Then 

I  I  I  I  Z 

(3.34)  j^a^^\\ui\\l  =  j^{af^^u^^R-^-afuJR-^Al)B-^^ 

i=l  i=l 

follows,  which  completes  the  proof  □ 

Combining  the  results  from  theorems  7  and  8  gives  a  sufficient  condition  for  the  shrink-and- 
iterate  approach  to  produce  a  better  low-rank  approximation  than  the  random  projection  method. 

Proposition  3.  Suppose  Xi  and  X2  are  random  and  real  orthonormal  blocks  with  n  rows  and  b 
columns  each.  Let  the  random  projection  method  be  run  on  a  real  input  matrix  A  with  start  block 
D.  =  [Xi  X2].  The  low-rank  approximation  of  A  restricted  to  Jt2(AA^,(AA^)^Xi)  will  have  larger 
norm  than  the  low-rank  approximation  in  spanlAA^H}  whenever 


(3.35) 


Z^i^^oH{L‘^U^D.yLL^Ui\\l< 


Z"i^^o^H{L^u^Xoyxlui\\l+ 

IKafufZoi?-!  -  ||2 


This  proposition  is  a  simple  combination  of  Theorems  7  and  8. 
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3.3.3  Discussion 

These  new  bounds  lack  the  simplicity  of  the  random  projection  bounds  (3.8)  or  the  Krylov  subspace 
eigenvalue  estimate  bounds  (3.9).  Immediate  insight  into  the  spectral  properties  that  discriminate 
between  cases  for  which  the  the  shrink-and-iterate  approach  will  outperform  the  random  projections 
method  is  difficult.  Therefore,  we  will  examine  the  bounds  in  (3.35)  more  closely  to  derive  insight 
into  what  spectral  properties  will  lead  to  a  better-performing  shrink-and-iterate  approach  over  the 
random  projections  method. 

Assume  that  the  input  matrix  is  square  and  positive  semi-definite  (if  not  substitute  A  in  the 
following  discussion  with  AA^).  We  note  that  an  equivalent  expression  to  (3.11)  is  simply 

(3.36) 

i=l 

This  observation  leads  to  the  following  proposition  regarding  the  relationship  between  the  input 
matrix  A  and  R  in  (3.36). 

Proposition  4.  Let  Abe  a  positive  semi-definite  real  matrix,  and  Xq  be  a  real  matrix  with  b  linearly 
independent  columns  and  the  product  AXo  defined.  Then  R^R  is  the  Cholesky  factorization  of  A^^ 
restricted  to  the  space  spanlAo). 

This  proposition  may  be  derived  from  observing  the  QR  factorization  QR  =  A^Xq  and 

(3.37)  R'^R  =  (QRfQR  =  X^A^^Xq. 

We  also  remark  that  since  R^R  =XqA^pXq, 

•  R^R  is  the  result  of  the  Rayleigh-Ritz  procedure  using  the  space  spanlAol, 

•  singular  values  aiiR)  approximate  eigenvalues  of  Ai(A^), 

•  singular  values  ai{R~^)  approximate  eigenvalues  of  Aj(A“^). 

As  the  norm  of  low-rank  approximations  from  the  shrink-and-iterate  approach  depends  on  both 
Ai  and  A2  from  (3.23),  we  also  must  reason  about  tr(A2).  From  Lemma  3,  we  have  a  formula  for  the 
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norm  of  the  image  of  left  singular  vectors  of  A  in  span{Q2})  which  leads  to  the  following  proposition. 

Proposition  5.  Let  Abe  a  positive  semi-definite  real  matrix,  let  Xq  be  a  real  matrix  with  b  orthonor¬ 
mal  columns  and  AXq  defined,  let  QiR  =  A^Xq  be  the  QR  factorization  of  A^Xq,  and  let  Ai  and  B2 
be  the  matrices  produced  by  one  step  of  the  block  Lanczos  process  presented  in  Algorithm  2  run  with 
inputs  A  =A  and  Qi  =  A^Xq.  Let  A  =  UAU^  be  the  spectral  decomposition  of  A  and  ui  be  the  image 
of  eigenvector  i  of  A  in  the  space  span{A^Xo}.  Then  the  squared  norm  of  the  image  of  eigenvector  Ui 
of  A  in  the  space  span{Q2}j  where  Q2  is  from  the  block  Lanczos  algorithm  is  given  by 

(3.38)  \\ui\\l  =  \\ia^uf -ujAi)B^^\\l 

We  remark  that 

•  uja^-uj  Alin  (3.38)  is  the  residual  of  the  eigenpair  (ai,Ui)  when  applied  against  the  low-rank 
approximation  A^^^  =  Ai  {b  is  from  Proposition  4)  of  A  in  the  space  spanlA^Xol, 

•  and  eigenpairs  with  both  large  and  small  magnitudes  may  have  large  residual,  and  the  size  of 
the  residual  is  determined  by  how  closely  eigenvalues  of  Ai  approximate  the  eigenvalue 
and  how  large  the  image  of  ui  is  in  span{(A^Xo}. 

•  If  the  space  span{(AA^)^Xo}  can  exclude  small  singular  triplets  or  eigenpairs  well,  then  they 
will  have  small  images  \\ui  ||  and  then  (3.38)  will  be  small  regardless  of  cr;. 

Overall,  there  is  a  deep  relationship  apparent  between  the  convergence  of  eigenvalues  of  X^  A^Xq 
or  XqA^pXq  for  increasing  p  and  the  approximation  properties  given  in  both  Theorems  7  and  8. 
Cases  for  which  the  power  method  converges  quickly  —  when  leading  eigenvalues  of  AA^  are  well- 
separated,  especially  compared  to  interior  and  trailing  eigenvalues  —  are  the  cases  for  which  the 
shrink-and-iterate  approach  will  outperform  the  random  projection  method.  When  A  is  Hermitian 
and  positive  semi-definite,  then  the  SVD  and  spectral  decomposition  coincide,  and  the  above  argu¬ 
ment  holds  with  AA^  replaced  with  A. 
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Figure  3.2:  Spectra  of  A  and  B  with  uniformly  and  exponentially  distributed  eigenvalues. 

3.3.4  Convergence  example 

To  better  illustrate  the  relationship  between  the  gaps  in  the  input  spectra  and  the  relative  perfor¬ 
mance  of  shrink-and-iterate  and  random  projections,  we  present  an  example.  We  consider  two  input 
matrices,  A  and  B,  both  diagonal,  and  where  diagonal  elements  of  A  are  drawn  from  a  uniform 
distribution  '%'(!,  1.1)  and  diagonal  elements  of  B  are  drawn  from  an  exponential  distribution  with 
scale  parameter  Xp  =  1.  Both  matrices  share  the  same  eigenspace;  only  the  eigenvalues  differ.  The 
spectra  of  both  matrices  are  shown  in  Figure  3.2.  Based  on  the  preceding  discussion,  we  would  ex¬ 
pect  B  to  have  spectral  properties  amenable  to  the  shrink-and-iterate  approach  producing  a  better 
low-rank  approximation  while  random  projections  would  produce  a  better  low-rank  approximation 
with  A.  This  is  indeed  the  case,  and  we  examine  the  constituent  sub-expressions  in  (3.35)  to  show 
how  images  of  eigenvectors  of  A  and  B  are  changed  by  the  power  iteration  process. 

As  the  random  projection  method  depends  solely  on  the  power  parameter  p  in  A^Xq  to  produce 
a  subspace  that  contains  the  leading  eigenvectors  of  A,  we  examine  the  norms  of  the  images  ui  of 
eigenvectors  Uj  of  A  and  B.  Since  the  spectrum  of  B  has  much  larger  gaps  between  the  leading 
eigenvalues,  we  expect  those  eigenvectors  associated  with  leading  eigenvalues  to  have  images  with 
larger  norms  in  span{BAo}.  Figure  3.3  shows  the  norms  of  images  of  eigenvectors  in  spanlAAol  and 
spanlBAo). 

Norms  of  the  images  of  eigenvectors  in  Xf2(A,A^Xi)  (with  Xi  being  composed  of  the  first  b/2 
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Figure  3.3:  Norms  of  images  of  eigenvectors  in  span{AXo}  and  span{fiXo}.  Eigenvectors  with  large  eigenvalues 
have  larger  images  in  the  space  of  B. 

columns  of  Xi)  are  more  complicated,  but  also  depends  largely  on  the  “invariance”  of  the  space 
—  A^PXiR~^Ai.  When  eigenvalues  are  tightly  clustered,  then  these  matrices  will  have 
much  smaller  norms  than  if  leading  eigenvalues  are  well-separated.  This  may  be  deduced  from  the 
formula  for  the  norm  of  the  image  of  an  eigenvector  Ui  in  the  space  span{A^'''^Wii?“^  — 
from  Proposition  5.  If  Aj  has  large  magnitude,  then  the  gap  A^^^  —  A^  will  be  large  compared  to  all 
other  gaps  and  the  difference  (3.29)  will  be  large  due  to  large  X^^^^uJXiR~^ .  If  Aj  is  small,  then 
the  left-hand  term  X^^^u'^XiR~^  of  X^^^ u'^ XiR~^  —  X^ u^XiR~^Ai  will  be  small,  but  the  right-hand 
term  will  be  larger  due  to  Ai  and  the  difference  will  still  be  large.  Thus,  leading  eigenvalues  and 
trailing  eigenvalues  both  may  converge  in  the  space  span{A^'''^Wii?“^-A^Wii?“^Ai},  and  the  degree 
of  convergence  of  trailing  eigenvectors  depends  on  how  much  larger  the  leading  gaps  X^^^uJXiR~^  — 
X^uJXiR-^Ai  are  compared  to  the  trailing  ones.  Figure  3.4  shows  the  values  of  (3.29)  for  A  and  B. 
Figure  3.5  shows  the  values  of  (3.29)  scaled  by  eigenvalues,  to  suggest  the  low-rank  approximation 
norms  of  the  two  matrices.  The  values  translate  directly  into  the  errors  of  low-rank  approximation 
norms.  When  leading  eigenvectors  converge  to  the  exclusion  of  trailing  eigenvectors,  as  is  the  case 
with  B,  then  4  block  Lanczos  iterations  will  produce  a  larger  low-rank  approximation  norm  than 
random  projections.  If  trailing  eigenvectors  converge  at  least  as  fast  as  the  leading  eigenvectors,  as 
is  the  case  with  A,  then  random  projections  produces  a  better  low-rank  approximation. 
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Figure  3.4:  Norms  of  eigenvectors  projected  against  for  A  (left)  andB  (right).  Leading 

eigenvalues  with  large  gaps  can  overwhelm  convergence  of  smaller  eigenvalues  for  S,  but  the  tighter  clustering 
of  eigenvalues  in  A  leads  to  inclusion  of  more  trailing  spectral  components. 


Figure  3.5:  Norms  of  images  of  eigenvectors  scaled  by  eigenvalues  in  spanfAXol  or  span{BXo}  and  span{Q}  for 
A  (left)  and  B  (right).  B  results  in  a  better  low-rank  approximation  norm  due  to  convergence  of  only  leading 
eigenvalues. 
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3.4  Stability  Analysis 

Thus  far  we  have  seen  both  the  block  Lanczos  algorithm  and  the  associated  random  projection  meth¬ 
ods,  along  with  convergence  analyses  of  the  two.  The  asymptotic  cost  of  a  shrink-and-iterate  ap¬ 
proach  compared  to  the  random  projection  method  allows  for  more  iterations  to  be  performed;  how¬ 
ever,  this  analysis  neglects  the  costs  of  stabilization  required  to  maintain  orthogonality  of  the  Lanc¬ 
zos  basis  as  iteration  progresses.  The  costs  of  stability  along  with  the  complications  they  present 
to  the  algorithmic  simplicity  are  a  motivation  noted  for  the  development  of  the  random  projection 
methods  in  [36].  The  loss  of  orthogonality  witnessed  in  the  Lanczos  algorithm  is  due  to  round-off 
error  rather  than  an  inherent  instability  in  the  Lanczos  recurrence.  We  also  note  that  there  are  two 
sources  of  loss  of  orthogonality  in  the  Lanczos  method:  not  only  can  converged  eigenvectors  re-enter 
the  Lanczos  basis,  but  it  is  also  possible  that  even  the  adjacent  Lanczos  blocks  Qj  and  Qj+i  may 
not  be  orthogonal  to  machine  precision  due  to  round-off  error.  The  extra  stabilization  steps  added 
to  Algorithm  4  are  intended  to  address  the  latter  of  the  two  sources  of  error.  Regarding  the  former 
source  of  error,  intuitively,  we  would  expect  a  short  Krylov  sequence  to  experience  less  problematic 
losses  of  orthogonality  than  a  longer  one.  Therefore  we  may  expect  the  shrink-and-iterate  approach 
to  encounter  only  a  limited  degree  of  orthogonality  loss  if  only  a  few  iterations  are  performed.  In  such 
cases,  no  stabilization  would  be  required  to  generate  a  orthonormal  Lanczos  basis,  and  the  classic 
block  Lanczos  method  as  presented  in  Algorithm  2  may  be  used  as-is. 

These  intuitions  are  indeed  proven  out  by  theory.  We  begin  by  characterizing  the  loss  of  orthog¬ 
onality  over  iterations  in  general,  and  then  focus  on  the  case  for  adjacent  Qj  and  Qj+i-  Bounds  on 
the  loss  of  orthogonality  are  presented  in  [73].  These  error  analyses  are  for  the  single-vector  Lanczos 
algorithm,  but  may  readily  be  generalized  to  the  block  Lanczos  algorithm.  The  results  are  summa¬ 
rized  in  our  presentation  and  proof  of  the  following  theorem,  which  is  an  generalization  of  the  proof 
in  [73]  to  the  block  Lanczos  method. 

Theorem  9.  Let  A  be  a  Hermitian  matrix.  Let  Q  be  the  Lanczos  basis  generated  by  i  +  1  steps  of  the 
block  Lanczos  routine  with  block  size  b  in  Algorithm  2.  Let  W  be  the  strictly  upper-triangular  part  of 
Q^Q  (W  is  zero  on  and  below  the  diagonal)  and  wj  =  [w(j-i)i,+i  W(j-i)b+2  ■■■  W(j-i)b+b\  the  block 
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of  columns  ofW  with  wq  =  0  and  Wi  represent  columns  ofW.  Then 

(3.39)  ||wj+il|  <(2||A||max{||wil|,||wi-il|}  +  0(e||A||2))||BJ+ill 

where  ||  •  ||  is  the  spectral  norm  and  e  is  machine  epsilon. 

Proof.  From  the  Lanczos  recurrence,  we  have 

(3.40)  Qj+iSj+i  =  AQj  -  QjAj  -  Qj-iBj  +  Fj 

where  Fj  is  a  error  term  to  account  for  round-off  imprecision.  Premultiplying  both  sides  by  the  entire 
Lanczos  basis  gives 

(3.41)  Q^Qj+iBj+i  =  AQj  —  Q^QjA  j  —  Q^Q  j-iBj  +  Q^Fj. 

Noting  that  the  matrix  form  of  the  whole  Lanczos  recurrence  AQ  =  QT  +  Q j+iB j+iEj  +F,  where  Ej 
is  the  “element  block”  Ej  =  [e(j-i)6+i  e(j-i)fe+2  ...  e(j-i)6+&]  and  F  is  an  error  term  for  round-off,  we 
may  obtain 

(3.42)  Q’^Qj+i  =  (TQ’^Qj+EjBj+iQj+iQj  +  F'’^Qj-Q'^QjAjQ'^Qj-iBj  +  Q'^Fj)B~jl.^. 

by  substitution  and  back-multiplication  with  Let  wy+i  be  the  j  +  1*^  block  of  Q^Qj+i.  Then  by 
substitution  we  have 

(3.43)  wj+i  =  (Twj  -  AjWj  -  Wj-iBj  +E jB j+iQ j+iQ j  +  F’^Qj  +  Q'^Fj)Bjl^. 

We  apply  the  matrix  norm  the  inequalities  ||X -i- Y||  <  ||X||  -i-  ||Y||  and  ||XY||  <  ||X||  ||Y||  to  obtain 
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Finally,  we  have  the  claim  from  [58]  that 

(3.45)  ||Fj||<e||A|| 

and  which  allows  us  to  bound  0(£’||A||)=  \\E  jBj+iQj+iQj  +  F'^Qj  +  Q'^F j\\.  As  HTH  +  ||Aj||  <2||A||  and 
lISjll  <  II A II,  we  have 

(3.46)  llwj+il|  <  ||wj+ill  <(2||A||max{||wj||,||wj-il|}  +  0(e||A||))||BJ^\|| 


□ 

The  preceding  round-off  error  analysis  also  lends  insight  into  the  mechanism  through  which 
Algorithm  4  reduces  orthogonality  loss  in  the  Lanczos  basis,  and  leads  to  the  following  proposition. 

Proposition  6.  Let  A  be  a  Hermitian  input  matrix,  and  Qj  for  j  =  be  the  Lanczos  blocks 

produced  by  k  iterations  of  the  block  Lanczos  method  in  Algorithm  2.  Then  the  orthogonality  between 
adjacent  Lanczos  blocks  is  given  by 

(3.47)  Q^Qj^^  =  QjFjB-.\. 

This  proposition  is  a  result  of  observing  that  =  0  in  infinitely-precise  arithmetic,  and 

premultipl3dng  (3.40)  by  Qj: 

(3.48)  QJ  Q  j+iB  j+i  =  QJ  AQ  j  —  QJ  Q  jA  j  —  QJQ  j-iB  j  +  Q  jF j 
which  simplifies  to  (3.47). 

Refinement  is  intended  to  reduce  ||QjFyRj^^||ir.  The  extra  stabilization  step  does  not  prevent 
converged  eigenvectors  from  entering  the  Lanczos  basis,  rather,  it  reduces  the  norm  of  QJQj+i  to  no 
more  than 
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3.4.1  Discussion 

The  resulting  bounds  on  error  are  very  similar  to  those  obtained  in  [73]  and  reduce  to  those  bounds 
when  the  block  size  is  1.  For  single-vector  Lanczos,  loss  of  orthogonality  is  governed  by  both  the 
largest  singular  value  of  A  and  the  smallness  of  which  is  a  scalar  value  in  the  single-vector  case. 
The  error  may  grow  faster  in  the  block  Lanczos  routine  than  single-vector  lanczos  due  to  the  effects 
of  II;  it  is  sufficient  that  only  the  smallest  singular  value  of  Bj+i  be  small.  The  Frobenuis  norm 
||Bj+i  II  may  be  large,  but  if  the  linear  independence  of  the  next  Lanczos  block  AQj  -  QjAj  -  Qj-\Bj 
is  nearly  lost,  then  Bj+i  will  have  a  small  infimal  singular  value.  These  are  the  cases  for  which 
deflation  of  the  Krylov  subspace  is  necessary.  However,  we  may  assume  that  in  practice  Cinf  of  Bj+i 
is  not  negligibly  small  for  small  j.  This  is  the  analog  of  an  assumption  made  in  [73],  and  is  almost 
always  true  in  practice,  especially  when  the  start  block  for  the  Krylov  subspace  is  random.  We  also 
remark  that  the  loss  of  orthogonality  for  the  special  case  of  ||wil|  is  simply  0(c),  as  the  Lanczos 
algorithm  is  equivalent  to  a  full  Gram-Schmidt  algorithm  for  i  =  1. 

The  immediate  result  of  (3.39)  and  our  assumption  that  ||By+i||  is  not  too  small  is  that  we  may 
always  generate  a  Lanczos  basis  for  Jffi{A,X())  without  any  appreciable  loss  of  orthogonality  for  small 
i.  As  our  compute  time  constraints  provide  only  for  i  <  4,  we  can  safely  assume  that  in  practice 
there  will  be  limited  loss  of  orthogonality  beyond  0(e||A||).  It  is  also  evident  that  (3.39)  does  not 
distinguish  between  the  classic  block  Lanczos  method  in  Algorithm  2  and  the  block  Lanczos  method 
with  refinement  in  Algorithm  4. 

3.5  Conclusion 

Random  projections  have  been  shown  to  produce  good  low-rank  approximations  without  the  loss  of 
orthogonality  problems  that  beset  the  Lanczos  algorithms  for  generating  Krylov  subspaces  for  Her- 
mitian  inputs.  Random  projection  methods  may  be  interpreted  as  pseudo-block  Krylov  subspace 
methods  with  a  large  start  block  and  no  true  iterations,  and  the  loss  of  orthogonality  in  the  Lanczos 
algorithm  is  proportional  to  the  number  of  iterations.  These  two  observations  suggest  that  block 
Krylov  subspaces  with  a  large  block  and  only  a  few  iterations  may  produce  low-rank  approximations 
with  error  at  least  as  small  as  the  random  projection  method  and  with  good  stability  without  use  of 
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reorthogonalization.  As  compute  time  costs  due  to  dense  linear  algebra  operations  grow  quadrati- 
cally  with  the  number  of  dimensions  used  in  the  random  projection  method,  shrinking  the  block  size 
and  increasing  the  number  of  iterations  may  produce  a  low-rank  approximation  with  smaller  error 
than  the  random  projection  method  but  still  in  smaller  time.  This  advantage  is  most  pronounced 
when  the  input  matrix  is  sparse  compared  to  its  dimension.  Additions  to  the  block  Lanczos  method 
improve  stability  without  the  need  for  reorthogonalization,  and  hybrid  random  projection-Krylov 
subspace  methods  may  capture  the  good  eigenvalue  approximation  properties  and  stability  of  the 
random  projection  method  and  the  better  computational  complexity  of  the  block  Lanczos  method. 

The  error  of  a  low-rank  approximation  generated  by  either  the  random  projection  method  or 
the  block  Lanczos  method  depends  on  the  spectrum  of  the  input  matrix  A  or  AA^ .  Naturally,  the 
spectrum  of  the  input  matrix  also  determines  which  of  the  two  approaches  produces  a  better  ap¬ 
proximation.  When  the  low-rank  approximation  problem  requires  eigenpairs  or  singular  triplets 
with  large  magnitude,  exclusion  of  the  trailing  eigenpairs  or  singular  triples  is  important.  However, 
we  have  shown  that  spectra  for  which  the  random  projection  method  is  able  to  successfully  exclude 
trailing  spectral  components  are  also  spectra  for  which  the  hybrid  random  projection-Krylov  sub¬ 
space  method  will  also  exclude  trailing  spectral  components.  This  suggests  that  input  matrices  with 
large,  well-separated  leading  eigenpairs  or  singular  triplets  and  small,  clustered  trailing  singular 
triplets  are  matrices  for  which  the  hybrid  and  shrink-and-iterate  approaches  will  produce  smaller 
low-rank  approximation  error  than  the  random  projection  method. 

The  expressions  which  give  the  squared  Frobenius  norm  of  low-rank  rank  approximation  error 
are  admittedly  not  as  elegant  as  the  asymptotic  bounds  due  to  Kaniel,  Paige,  Saad  or  Halko  et.  al. 
Further  simplification  of  the  bounds  may  yield  simpler  expressions  that  may  further  clarify  the  role 
of  the  input  spectrum.  Nevertheless,  the  observations  in  the  following  chapter  confirm  the  insight 
developed  in  our  discussion  comparing  the  low-rank  approximation  norms  generated  by  the  random 
projection  method  and  the  hybrid  method.  A  useful  future  effort  would  further  characterize  the 
asymptotic  nature  of  eigenvalue  convergence  in  the  hybrid  method  using  bounds  similar  to  Theo¬ 
rem  3;  this  would  allow  for  comparison  of  worst-case  eigenvalue  convergence  of  the  two  methods. 
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Applications  of  short  block  Krylov  subspaces 


Block  Krylov  subspaces  and  random  projections  have  certain  similarities.  Random  projection  meth¬ 
ods  may  be  considered  as  quasi-Krylov  methods.  Though  they  exhibit  good  approximation  and  sta¬ 
bility  properties,  we  have  shown  that  the  complexity  of  the  random  projection  method  is  quadratic 
in  the  number  of  dimensions  returned.  The  relationship  between  block  Krylov  subspaces  may  affect 
a  faster  low-rank  approximation;  by  shrinking  the  block  size  and  performing  a  limited  number  of 
iterations,  a  block  Krylov  subspace  can  produce  a  low-rank  approximation  faster  than  the  random 
projection  method,  or  produce  a  low-rank  approximation  with  smaller  error  in  equivalent  time.  It  is 
possible  that  the  Krylov  subspace  approximation  may  have  both  smaller  error  and  be  faster  to  com¬ 
pute  than  the  random  projections  approximation.  Only  performing  a  limited  number  of  iterations 
may  also  practically  eliminate  the  need  for  reorthogonalization  in  the  Krylov  subspace  generation. 
The  preceding  chapter  dealt  with  the  theoretical  comparison  between  random  projections  and  the 
shrink-and-iterate  Krylov  subspace  approach,  but  implementation  details  will  dictate  the  realized 
stability,  approximation  errors  and  compute  times.  In  particular,  hardware  features  and  program¬ 
ming  aspects  can  have  a  substantial  impact  on  compute  times. 

To  show  both  the  approximation  performance,  numerical  stability  and  compute  costs  of  random 
projections  and  small  numbers  of  block  Lanczos  iterations,  we  performed  numerical  experiments. 
These  experiments  are  intended  to  highlight  the  convergence  differences  between  random  projections 
and  block  Lanczos  iteration  for  a  variety  of  practical  applications  of  low-rank  matrix  approximation. 
We  present  low-rank  approximation  applied  to  image  compression  for  the  facial  recognition  problem. 
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applied  to  dimension  reduction  for  information  retrieval,  applied  to  graph  visualization  and  analysis, 
and  to  mechanical  engineering  problems.  Overall,  the  results  confirm  the  expectations  developed  in 
the  previous  chapter.  When  the  leading  singular  values  or  eigenvalues  define  the  optimal  low-rank 
approximation  are  large  and  well-separated,  we  have  observed  that  both  the  shrink-and-iterate  and 
hybrid  approaches  produce  low-rank  approximations  with  larger  norm  than  the  random  projection 
method.  FLOP  analyses  also  reflect  the  complexity  of  the  two  algorithms;  the  shrink-and-iterate 
approach  with  s  =  2  require  FLOPS  comparable  to  the  random  projection  method.  The  addition  of 
the  refinement  step  in  Algorithm  4  adds  extra  overhead,  but  the  hybrid  approach  was  always  faster 
than  random  projections.  Fewer  iterations  translate  to  less  loss  of  orthogonality  in  block  Lanczos 
iteration,  but  loss  of  orthogonality  was  never  larger  than  10“®  for  classic  block  Lanczos.  The  addition 
of  a  refinement  step  improved  orthgonality  by  roughly  an  order  of  magnitude.  Finally,  the  hybrid 
approach  always  had  equivalent  stability  to  random  projection. 

These  problems  were  run  on  one  of  two  computers,  depending  on  memory  requirements.  Prob¬ 
lems  with  smaller  memory  footprints  were  run  on  a  Apple  MacPro  workstation  running  MacOS 
10.6.8,  with  dual  quad-core  Intel  Xeon  processors  running  at  3  Ghz  and  8  GB  of  memory.  Software 
was  compiled  with  GCC  4.2.1  and  linked  against  the  native  Apple  vecLib  BLAS  and  LAPACK.  Su¬ 
per  LU  4.3  was  used  for  sparse  factorization.  Problems  with  larger  memory  footprints  were  run  on 
a  Newisys  Linux  server  running  RedHat  enterprise  2.6,  with  8  dual-core  AMD  Opteron  processors 
running  at  1  Ghz  and  30  GB  of  memory.  Software  was  compiled  with  GCC  4.4.6  and  linked  against 
boost  uBlas  1.47.  Boost  uBLAS  was  only  used  for  sparse  matrix-vector  products.  Other  linear  algebra 
operations  were  linked  against  ATLAS  version  3.8.3. 

4.1  Experiments  with  the  Yale  face  data 

The  Eigenfaces  method  [80]  is  a  popular  method  that  uses  the  truncated  eigenvalue  decomposition 
to  both  reduce  dimensionality  and  clean  data.  Given  a  set  of  images,  one  may  flatten  each  on  into 
a  vector  a;  and  assemble  them  into  a  matrix.  After  centering  by  subtraction  of  the  average  image 
/i  =  o j/m,  the  resulting  matrix  A  —  jU  =  UYV'^  may  be  diagonalized.  The  left  singular  vectors  give 
the  “eigenfaces,”  and  the  right  singular  vectors  give  the  coordinates  of  each  image  in  the  reduced 
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eigenvalue  index  eigenvalue  index 

Figure  4.1:  Leading  100  eigenvalues  of  A^A  for  the  Yale  eigenface  data  (left)  and  eigenvalue  gaps  (right), 
dimension  space. 

Chen  and  Saad  studied  the  application  of  Krylov  subspaces  to  the  eigenface  problem  in  [15]  by  ap¬ 
proximating  the  SVD  reduction.  They  noted  that  results  of  comparable  quality  may  be  realized  with 
a  Krylov  subspace  approximation  of  the  SVD;  this  approximation  yields  substantial  computational 
savings  over  the  truncated  SVD. 

Data  for  these  applications  tends  to  be  “almost  low-rank,”  like  many  PCA  applications;  the  spec¬ 
trum  of  the  centered  matrix  A  —  jU  has  a  few  leading  singular  values  that  are  large,  with  singular 
values  becoming  more  clustered  towards  the  interior  of  the  spectrum.  We  used  the  Yale  extended 
face  database  B,  with  centered  images  [29].  Figure  4.1  shows  the  spectrum  of  (A  -  ^)(A  -  .  One 

particular  advantage  of  this  data  set  is  that  there  are  sufficiently  few  images  that  the  entire  spec¬ 
trum  of  the  matrix  B  =  {A  —  /i)(A  —  jj,)^  may  be  computed  directly. 

The  large  local  gaps  cause  the  leading  two  eigenvalues  of  A  to  converge  rapidly.  Therefore,  a 
point  of  diminishing  returns  on  added  dimensions  is  quickly  reached,  even  with  the  truncated  SVD. 
Since  these  two  eigenvalues  also  converge  rapidly  in  a  Krylov  subspace,  the  Krylov  subspace  approx¬ 
imations  are  close  to  the  SVD  approximation. 

The  substantial  gaps  between  the  leading  two  eigenvalues  of  the  Yale  face  data  and  the  rest  of 
the  spectrum  implies  good  convergence  of  these  two  eigenvalues  even  without  the  need  for  power 
refinement  in  the  random  projection  method  or  a  large  value  of  p  in  the  hybrid  algorithm.  Ta¬ 
ble  4.1  shows  the  precomputed  lower  bounds  for  A^^^  e  JF3(A,Y’o)  and  A^^^  e  JF4(A,Y’o)  along  with 
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dimension 

4 

8 

16 

32 

64 

random  projections 

2.2275eH-ll 

2.3864eH-ll 

2.5255eH-ll 

2.6538e-i-ll 

2.7589eH-ll 

Jt3(A,A’o)  lower  bounds 

2.2069eH-ll 

2.3911eH-ll 

2.5335eH-ll 

2.6542e-i-ll 

2.7591eH-ll 

Jff4{A,Xo)  lower  bounds 

2.2903e-Hll 

2.4387e-i-ll 

2.5649eH-ll 

2.6849e-i-ll 

2.7815eH-ll 

Table  4.1:  Computed  values  for  random  projection  and  block  Lanczos  lower  bounds  for  J^3(A,Xo)  and 
^4(A,Xo)  for  the  Yale  eigenface  data. 

the  precomputed  actual  value  for  e  spanlAH}  for  the  random  projection  methods.  The  random 
projection  methods  will  not  produce  worse  low-rank  approximations  than  ^3{A,Xq)  for  any  dimen¬ 
sion,  but  will  never  outperform  Jff4{A,Xo)  for  any  dimension.  The  actual  values  produced  are  shown 
in  Figure  4.2.  Figure  4.2  also  shows  the  low-rank  approximation  errors  for  the  hybrid  approach  in 
Algorithm  5.  There  is  not  a  substantial  difference  between  A^^^  e  J^3(A,Xq)  and  A*^^  e  Jt2(A,AA’o). 
There  is  a  larger  difference  between  A**^  restricted  to  J^4{A,Xq)  and  A*^^  restricted  to  Jt2(A,  A^Xq); 
however,  both  A^^^  restricted  to  JF4(A,A’o)  and  A^^^  restricted  to  J!r2(A,A^A’o)  had  larger  Frobenius 
norms  than  A^*^  restricted  to  span{Af2}. 

The  large  local  gaps  cause  the  leading  two  eigenvalues  of  A  to  converge  rapidly;  these  eigenval¬ 
ues  also  represent  more  than  70%  of  the  squared  Frobenius  norm  of  A.  Figure  4.2  also  shows  the 
low-rank  errors  for  the  hybrid  method  with  p  =2,  using  both  classic  block  Lanczos  and  block  Lanc¬ 
zos  with  refinement.  There  is  not  a  substantial  difference  between  approximations  in  Jt3(A,A’o) 
and  Jir2(A,  AXo).  Both  approximations  in  Xf4iA,Xo)  and  JF2(A,A^A’o)  had  larger  Frobenius  norms 
than  the  approximation  in  spanlAH}.  Compared  against  single-vector  Lanczos  iteration,  the  random 
projection  method  produces  low-rank  approximations  that  are  close  in  norm.  Random  projections 
produce  approximations  with  smaller  error  for  small  dimensions. 

The  large  leading  eigenvalues  also  imply  that  B2  from  the  block  Lanczos  algorithm  has  a  large 
spectral  norm.  Therefore  we  expect  to  see  a  non-trivial  value  for  the  norm  of  the  initial  error  term  Fj 
in  (3.40),  and  the  block  Lanczos  iteration  with  refinement  will  be  much  more  stable  than  classic  block 
Lanczos  even  for  the  first  iteration.  Each  iteration  compounds  the  error  accumulated  in  previous 
iterations  as  predicted  by  at  least  some  multiple  of  ||A||.  We  expect  to  see  non-trivial  stability  im¬ 
provements  in  block  Lanczos  with  modified  orthgonalization  over  classic  block  Lanczos  for  JtsCA, Xq) 
and  Xf4(A,Xo).  Figure  4.3  shows  the  loss  of  orthogonality  for  random  projections,  shrink-and-iterate 
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Figure  4.2:  Low-rank  approximation  errors  of  Yale  face  data;  random  projections  versus  classic  Lanczos  (top 
left)  and  random  projections  versus  the  hybrid  algorithm  with  p  -  2  (top  right).  Squared  Frobenius  norm  of 
random  projections  and  block  Lanczos  iteration  compared  against  single-vector  Lanczos  (bottom). 
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subspace  dimension  subspace  dimension 


Figure  4.3:  Maximum  loss  of  orthogonality  in  projection  basis  using  Yale  face  data;  random  projections  versus 
Lanczos  (left)  and  random  projections  versus  the  hybrid  approach  with  refined  block  Lanczos  (right).  Top  row 
figures  are  for  classic  block  Lanczos,  and  bottom  row  figures  are  for  block  Lanczos  with  refined  orthogonaliza- 
tion. 

using  classic  block  Lanczos  and  block  Lanczos  with  modified  orthogonalization  for  dimensions  up  to 
64.  We  measure  the  maximum  absolute  value  of  the  cosine  of  the  angle  between  any  two  basis  vec¬ 
tors;  this  is  a  more  conservative  measure  than  the  Frobenius  norm  used  in  (3.39).  We  remark  that 
classic  block  Lanczos  with  initial  power  iteration  loses  substantial  orthgonality  even  for 
however,  the  hybrid  method  using  block  Lanczos  with  modified  orthogonalization  is  as  stable  as  the 
random  projection  method  when  only  2  Lanczos  iterations  are  performed.  That  the  spectral  norm 
II A II  is  large  implies  that  round-off  error  may  render  adjacent  Lanczos  blocks  non-orthogonal,  even  in 
just  the  first  iteration.  This  loss  of  orthogonality  is  due  not  to  convergence  of  an  eigenvector,  but  from 
roundoff  error  rendering  Qj  not  orthogonal  to  AQj  —  Q  jAj  —  QjBj-i.  Figure  4.4  shows  the  Frobenius 
norm  of  QJQj+i  for  the  first  10  iterations  of  the  block  Lanczos  method  run  on  A^A  from  the  Yale 
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Figure  4.4:  Frobenius  norm  of  Q^Qj+i  for  10  iterations  of  block  Lanczos  on  A^A  from  the  Yale  data  using  a 
block  size  of  10. 

eigenface  data  using  a  block  size  of  10.  It  is  evident  that  stabilization  does  indeed  eliminate  this 
error  up  to  iteration  5,  but  cannot  prevent  loss  of  orthogonality  due  to  convergence  of  eigenvectors  at 
iteration  5. 

4.2  Low-rank  approximation  in  Information  Retrieval 

Information  retrieval  problems  use  a  truncated  SVD  as  part  of  Latent  Semantic  Indexing  (LSI). 
Documents  are  represented  as  vectors  in  a  “term  space:”  dimensions  of  the  vector  space  correspond 
to  unique  terms  in  the  corpus.  If  the  vector  x  e  (N  u  {0})”  is  a  document  vector,  then  the  dimension 
Xj  is  the  number  of  times  term  i  occurs  in  the  document.  The  typical  application  of  LSI  involves 
assembling  all  document  vectors  into  a  matrix  A  and  applying  a  scaling  factor  to  all  terms  and 
documents  to  amplify  the  importance  of  infrequent  terms  and  diminish  the  importance  of  frequent 
ones.  The  truncated  SVD  of  A  gives  the  “latent  space”  that  separates  noise  from  salient  features  of 
the  corpus.  As  a  result,  low-rank  approximation  of  the  term-document  matrix  may  improve  query 
results  by  filtering  noise.  LSI  is  useful  both  for  corpus  analysis  as  well  as  dimension  reduction  for 
quer3dng. 

To  study  the  approximation  of  LSI  using  block  Krylov  subspace  projections  and  random  projec¬ 
tions,  we  study  two  problems:  one  using  a  small  term-document  matrix  from  the  canonical  NPL 
quer3dng  benchmark  [78],  and  a  term-document  matrix  obtained  from  the  UCI  Machine  Learning 
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Figure  4.5:  Leading  100  eigenvalues  of  A^A  for  the  Enron  email  corpus  (left)  and  eigenvalue  gaps  (right). 


dimension 

4 

8 

16 

32 

64 

random  projections 

6.2838eH-00 

1.1118e-i-01 

2.0028eH-01 

3.5977e-i-01 

6.2632eH-01 

Jt3(A,Xo)  lower  bounds 

6.6528eH-00 

1.1914eH-01 

2.2057eH-01 

3.7200eH-01 

6.3089eH-01 

^s{A,Xq)  lower  bounds 

8.3926eH-00 

1.5099e-i-01 

2.6700eH-01 

4.4466e-i-01 

7.3493eH-01 

Table  4.2:  Computed  values  for  random  projection  and  block  Lanczos  lower  bounds  for  and 

J^4{A,Xo)  for  Enron  email  corpus  term-document  matrix. 


Repository’s  Bag  of  Words  data  set  [26].  These  are  intended  to  show  two  distinct  sides  of  low-rank 
approximation  as  applied  to  information  retrieval  problems:  the  UCI  data  is  large  and  intended  to 
show  compute  costs  for  low-rank  matrix  approximations  costs.  The  NPL  experiments  are  intended 
to  show  the  consequences  of  low-rank  approximation  when  matrix  norm  is  only  loosely  coupled  with 
application  performance. 

The  NPL  is  widely  cited  benchmark  for  query  processing  systems;  it  consists  of  abstracts  from 
physics  technical  reports  and  queries  with  relevance  scores.  The  term-document  matrix  measures 
7,491  X  11,429  with  2,228,087  nonzero  entries.  The  NPL  collection  also  contains  100  queries  with 
relevance  scores.  The  Bag  of  Words  data  collection  contains  three  distinct  corpora;  we  used  the 
Enron  email  corpus.  The  data  does  not  include  any  queries  for  evaluation.  The  resulting  matrix  is 
39,861  X  28,099  with  3,710,420  non-zero  elements.  We  rescaled  the  NPL  data  with  term-frequency 
and  inverse  document  frequency.  The  spectrum  of  the  matrix  A^A  for  the  Enron  corpus  is  shown  in 
Figure  4.5. 

Figure  4.6  shows  the  norm  of  the  low-rank  approximations  to  A  from  the  Enron  corpus  at  various 
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Figure  4.6:  Low-rank  approximation  errors  of  Enron  email  corpus;  random  projections  versus  classic  Lanczos 
(top  left)  and  random  projections  versus  the  hybrid  random  projection-Krylov  subspace  method  with  p  -  2 
(top  right).  Squared  Frobenius  norm  of  random  projections  and  block  Lanczos  iteration  compared  against 
single-vector  Lanczos  (bottom). 


dimensions.  Both  classic  block  Lanczos  and  the  hybrid  method  produce  smaller  low-rank  approxima¬ 
tion  errors  than  random  projections.  The  hybrid  method  does  not  approximate  the  SVD  as  well  as  in 
the  Yale  experiment,  this  is  likely  due  to  the  smaller  gaps  in  the  leading  part  of  the  spectrum.  Block 
Lanczos  iteration  is  more  stable  with  the  Enron  email  corpus  term-document  matrix  than  for  the 
Yale  eigenface  matrix.  This  better  stability  may  be  explained  with  two  interpretations:  the  smaller 
spectral  norm  of  the  Enron  email  corpus  implies  the  initial  error  term  Fj  in  (3.40)  will  be  smaller.  Al¬ 
ternately,  the  smaller  leading  eigenvalue  gaps  imply  slower  convergence  of  the  leading  eigenvalues. 
A  direct  consequence  of  this  slower  convergence  is  that  Lanczos  iteration  is  more  stable  as  shown  in 
the  previous  chapter.  Figure  4.7  contains  the  maximum  cosine  angle  between  any  pair  of  Lanczos 
vectors,  which  measures  the  loss  of  orthogonality.  Compared  with  the  loss  of  orthogonality  witnessed 


66 


4.  Applications  of  short  block  Krylov  subspaces 


--  K,(A.X„) 


random  projections 


I 


K2(A,AX,^) 

K^iA.A^  X„) 

random  projections 


subspace  dimension 


subspace  dimension 


--  ^3 (A, Xq)  with  stabilization 

;  ■■■  X2(A,AX„)  with  Stabilization 

X4 (A, Xq)  with  stabilization 

w 

— •  X2  (A, A^  Xy)  with  stabilization 

.  random  projections 

2 

.  random  projections 

S  10'^^ 

c 

c 

1  lo"'' 
.1 

0 

g  10-15 

— .  "" 

X 

£ 

1  rt-16 

subspace  dimension 


subspace  dimension 


Figure  4.7:  Maximum  loss  of  orthogonality  in  projection  basis  for  Enron  email  corpus;  random  projections 
versus  Lanczos  (left)  and  random  projections  versus  the  hybrid  method  using  refined  block  Lanczos  (right). 
Top  row  figures  are  for  classic  block  Lanczos,  and  bottom  row  figures  are  for  block  Lanczos  with  modified 
orthogonalization. 
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Figure  4.8:  FLOP  counts  for  producing  low-rank  approximations  of  the  Enron  email  corpus.  From  left  to  right 
the  plots  show  FLOPS  for  J^s{A,Xo),  J^4{A,Xo),  Xf2(A,A^Xo),  and  random  projections.  FLOPS  for  sparse 
matrix  operations  are  considered  separately  from  FLOPS  for  dense  linear  algebra  operations. 


in  the  Yale  eigenface  experiment  in  Figure  4.3,  block  Lanczos  loses  less  orthogonality  when  applied 
to  the  Enron  email  corpus.  The  minimum  vector  angle  cosine  is  two  orders  of  magnitude  smaller  for 
the  Enron  email  corpus  than  for  the  Yale  eigenface  data. 

The  Enron  email  corpus  is  a  sufficiently  large  problem  to  produce  meaningful  FLOP  comparisons. 
Based  on  the  preceding  theoretical  discussion,  shrinking  the  block  size  by  50%  should  be  expected  to 
reduce  the  compute  time  compared  to  random  projections  by  a  factor  of  four  per  iteration.  Three  clas¬ 
sic  Lanczos  iterations  should  be  expected  to  be  50%  faster  than  random  projections,  and  four  classic 
Lanczos  iterations  should  be  expected  to  be  equivalent  to  random  projections.  The  hybrid  method 
requires  only  2  block  Lanczos  iterations,  but  will  require  an  extra  QR  factorization  between  the  ini¬ 
tial  power  iteration  and  the  block  Lanczos  step.  Figure  4.8  shows  the  FLOP  counts  for  producing  a 
low-rank  approximation  of  the  Enron  email  corpus.  As  floating-point  operations  due  to  dense  matrix 
operations  and  sparse  matrix  operations  may  exhibit  different  levels  of  performance,  we  distinguish 
between  them  in  our  FLOP  analysis.  All  methods  exhibit  similar  growth  of  FLOP  requirements  as 
the  dimension  of  the  computed  subspace  increases.  A  majority  of  the  expense  of  all  algorithms  is 
sparse  matrix  operations;  the  total  number  of  nonzeros  in  A^A  is  7,420,840,  while  the  subspace 
dimension  is  20,899.  The  ratio  between  the  FLOP  counts  shows  the  expected  speedup  between  the 
different  methods.  Figure  4.9  shows  the  ratios  between  the  total  FLOP  counts.  For  all  dimensions, 
the  hybrid  approach  and  4  block  Lanczos  iterations  with  a  50%  smaller  block  are  roughly  no  more 
expensive  than  random  projections,  but  3  block  Lanczos  iterations  are  roughly  75%  of  the  cost  of 
random  projections. 

As  the  asymptotic  discussion  predicts,  four  iterations  of  block  Lanczos  are  roughly  equivalent  in 
terms  of  compute  time  to  the  random  projection  method.  Loss  of  orthogonality  may  render  4  block 
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--  X4(A,X„)/rardom  projection  FLOPS 
-  -  X3(A,X„)/rardom  projection  FLOPS 
X2(A,A^  X„)/random  projection  FLOPS 

K’4(A,X„)/rardom  projection  FLOPS 
K, (A, X„)/rardom  projection  FLOPS 
X2(A,A^  X„)/random  projection  FLOPS 


X4(A,X„)/random  projection  FLOPS 
X, (A, X„)/random  projection  FLOPS 
X2(A,A^  X„)/random  projection  FLOPS 


Figure  4.9:  Ratios  of  sparse  (left),  dense  (middle)  and  total  (right)  FLOPS  for  producing  a  low-rank  approxi¬ 
mation  of  the  Enron  email  corpus. 


Lanczos  iterations  not  practical.  The  loss  of  orthogonality  is  improved  by  the  addition  of  a  refinement 
step,  but  at  a  cost.  Adding  a  refinement  step  increases  the  dense  FLOPS  by  one-third  for  4  Lanczos 
iterations.  However,  for  the  Enron  collection,  as  well  as  the  Yale  eigenface  data,  the  difference  in 
norm  between  the  hybrid  approach  with  p  =  2  and  four  classic  block  Lanczos  iterations  is  negligible. 
In  these  cases,  using  the  hybrid  method  can  manage  the  increased  cost  of  the  refinement  step  added 
to  Lanczos. 

Compute  costs  and  approximation  quality  are  just  one  facet  of  the  performance  of  low-rank  ap¬ 
proximation  as  applied  to  information  retrieval  problems.  As  noted  earlier,  relevance  scores  of  in¬ 
formation  retrieval  are  not  entirely  dependent  on  Frobenius  norm  optimization.  Therefore,  we  may 
not  necessarily  expect  that  simply  producing  a  better  approximation  of  the  SVD  will  yield  improved 
precision  or  recall  for  a  query  processing  system  using  our  low-rank  approximations.  To  illustrate 
how  query  processing  performance  changes  with  different  SVD  approximations,  we  applied  low-rank 
approximation  to  the  NPL  collection. 

For  query  ranking,  we  rescaled  the  term-document  matrix  as  described  previously,  and  ranked 
each  document  by  its  vector  cosine  against  the  query  as  projected  through  the  latent  document  space. 
This  reflects  the  use  of  the  latent  space  not  as  a  dimension  reducer,  but  as  a  query  expander  [10].  Note 
that  this  method  computes  the  query-document  between  document  d  cosine  as 


(4.1) 


cos(ai,g) 


{P^ajfP^q 

WaMP^qW 


ai{PP^)q 

WaiWWP'^ql 


for  projection  matrix  P,  which  is  equivalent  to  the  document  at  not  being  approximated  at  all.  Never- 
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dimensions 


dimensions 


Figure  4.10:  Document-level  precision  at  100  documents  (left);  this  metric  reflects  the  relevant  percentage  of 
the  first  100  documents.  The  Frobenius  norm  error  of  low  rank  approximation  of  the  NPL  collection  term- 
document  matrix  (right). 


theless,  this  method  produced  the  best  document- level  average  results.  For  each  query,  we  computed 
the  document-level  recall  at  100  document  retrieved;  this  metric  measures  the  number  of  relevant 
documents  returned  in  the  first  100  search  results.  As  the  results  varied  based  on  the  random 
start  block,  we  averaged  the  results  over  10  different  random  start  blocks.  The  document-level  re¬ 
call  is  shown  in  Figure  4.10  with  the  norms  of  the  low-rank  term-document  matrix  approximations. 
Though  the  shrink-and-iterate  approach  using  Jf4(A,A’o)  produced  more  relevant  query  results  than 
the  random  projection  method,  the  gap  between  relevance  is  smaller  than  the  gap  in  Frobenius 
norm.  Indeed,  the  Krylov  subspace  approximation  is  almost  equivalent  in  matrix  norm  to  the  SVD 
approximation,  but  lags  in  query  relevance.  These  observations  are  consistent  with  our  expectations 
regarding  the  decoupled-ness  of  approximation  norm  and  query  performance. 


4.3  Experiments  with  the  Colorado  rood  network  Loplocion 

Low-rank  approximation  is  applicable  to  various  spectral  graph  problems.  One  such  problem  is  the 
computation  of  the  commute-time  embedding  [70],  which  may  be  used  for  clustering  [61,  87]  or  vi¬ 
sualization.  Spectral  clustering  [54],  Laplacian  embeddings  [6],  spectral  layout  algorithms  [37,43] 
and  network  analysis  methods  [53]  are  further  applications  that  require  eigenvectors  of  the  graph 
Laplacian.  The  majority  of  spectral  graph  problems  use  the  graph  Laplacian  matrix  L  and  require 
the  smallest  eigenpairs  of  L  rather  than  the  largest.  These  eigenpairs  may  not  be  computed  with 
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eigenvalue  index  eigenvalue  index 

Figure  4.11:  Leading  100  eigenvalues  of  for  the  Colorado  road  network  pseudoinverse  (left)  and  eigenvalue 
gaps  (right). 

Algorithm  3,  as  it  approximates  the  leading  eigenpairs  of  L.  Krylov  subspace  methods  do  provide  ap¬ 
proximations  of  these  trailing  eigenpairs,  but  the  approximations  converge  slowly  when  the  trailing 
eigenpairs  are  tightly  clustered.  Unfortunately,  the  trailing  eigenpairs  of  graph  Laplacian  matri¬ 
ces  are  typically  tightly  clustered  for  non-trivial,  large  problems.  Nevertheless,  for  certain  classes  of 
graphs,  shift-and-invert  preconditioning  is  computationally  inexpensive.  Then  the  problem  of  finding 
the  smallest  eigenpairs  of  the  graph  Laplacian  matrix  L  is  replaced  by  finding  the  largest  eigenpairs 
of  the  graph  Laplacian  psuedoinverse  .  Moreover,  the  commute-time  embedding  is  defined  as 
Ai/2jyr  low-rank  approximation  error  tr(L''')  — tr(L**^''')  =  \\L^  — is  indicative  of  squared 
Frobenius  norm  error  of  the  approximation  of  the  commute-time  embedding. 

To  study  the  approximation  of  Laplacian  pseudoinverses,  we  used  the  road  graph  network  of 
Colorado,  built  from  the  U.S.  Census  Bureau’s  TIGER/Line  data  [1].  The  directly  resulting  graph 
contained  several  connected  components;  we  extracted  the  largest.  The  graph  also  contained  nu¬ 
merous  subgraphs  isomorphic  to  the  path  graph,  these  were  converted  to  a  single  edge.  Two  edge 
weights  were  available  for  each  edge,  we  used  distance  rather  than  travel  time.  Figure  4.11  shows  the 
spectrum  of  the  Colorado  road  graph  pseudoinverse.  Leading  eigenvalues  are  better-separated  than 
for  the  Enron  document  collection.  The  better  leading  separation  of  eigenvalue  implies  rapid  con¬ 
vergence  of  leading  eigenvalues  of  L'*',  even  in  the  worst  case  predicted  by  the  bounds  in  Theorem  6. 
Though  the  first  few  leading  gaps  are  large,  the  eigenvalue  magnitudes  flatten  out  more  rapidly  than 
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dimension 

4 

8 

16 

32 

64 

random  projections 

3.1809eH-01 

4.0320e-i-01 

4.9737eH-01 

5.8750eH-01 

6.7062eH-01 

^s(A,Xo)  lower  bounds 

3.0910eH-01 

4.1032eH-01 

4.9682eH-01 

5.8996e-i-01 

6.7459eH-01 

Jff4{A,Xo)  lower  bounds 

3.5660e-H01 

4.3742e-i-01 

5.3111eH-01 

6.2135e-i-01 

7.0121eH-01 

Table  4.3:  Computed  values  for  random  projection  and  block  Lanczos  lower  bounds  for  J^3(A,Xo)  and 
^4(A,Xo)  for  the  Colorado  road  network  Laplacian  pseudoinverse. 

for  the  Yale  eigenface  data.  As  eigenvalues  flatten  out  more  quickly  as  for  the  Yale  eigenface  data, 
less  convergence  may  be  driven  by  simply  increasing  block  size.  We  may  expect  that  random  projec¬ 
tions  to  produce  approximations  that  compare  less  favorably  against  3  or  4  block  Lanczos  iterations 
than  the  comparison  in  Table  4.1.  Table  4.3  shows  the  a  priori  bounds  for  the  random  projection 
method,  and  for  3  and  4  iterations  of  the  block  Lanczos  method.  The  differences  between  3  block 
Lanczos  iterations  and  random  projections  are  less  tight  than  for  the  Yale  eigenface  data,  and  the 
bounds  predict  that  the  crossover  point  between  random  projections  and  Jt3(A,Y’o)  happens  sooner 
than  for  the  Yale  eigenface  set.  The  smaller  local  gaps  lead  to  a  larger  difference  between  the  SVD 
approximation  and  the  approximations  from  random  projections  or  Krylov  subspaces.  Figure  4.12 
shows  the  norms  of  the  low-rank  approximations  generated  by  random  projections  and  using  short 
block  Krylov  subspace  projections.  Overall,  the  differences  between  the  approximation  methods  con¬ 
sidered  in  this  chapter  are  similar  between  the  Yale  eigenface  experiment  and  this  experiment  with 
the  Colorado  road  network.  Random  projections  produce  approximations  that  are  close  in  error  to 
single-vector  Krylov  subspace  projections,  while  halving  the  block  size  and  performing  three  or  four 
Lanczos  iterations  may  produce  better  low-rank  approximations. 

The  FLOP  trends  witnessed  in  the  Colorado  road  network  experiments  are  also  similar  to  those 
witnessed  in  the  Enron  email  corpus  experiments.  Classic  block  Lanczos  with  a  halved  block  size  is 
50%  faster  than  random  projections  for  3  iterations  and  roughly  equivalent  to  random  projections 
for  4  iterations.  Adding  a  reflnement  step  to  the  Lanczos  algorithm  also  causes  an  increase  in  cost 
per  iteration  that  may  be  offset  by  using  the  hybrid  approach.  Figure  4.13  shows  the  FLOP  counts 
for  producing  low-rank  approximations  of  the  inverted  Colorado  road  network  graph  Laplacian.  All 
methods  exhibit  similar  growth  of  FLOP  requirements  as  the  dimension  of  the  computed  subspace 
increases.  However,  the  ratio  between  the  FLOP  counts  shows  the  expected  speedup  between  the 
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Figure  4.12:  Low-rank  approximation  errors  of  Colorado  road  network  Laplacian  pseudoinverse;  random  pro¬ 
jections  versus  classic  Lanczos  (top  left)  and  random  projections  versus  the  hybrid  approach  with  p  -  2  (top 
right).  Squared  Frobenius  norm  of  random  projections  and  block  Lanczos  iteration  compared  against  single¬ 
vector  Lanczos  (bottom). 


Figure  4.13:  FLOP  counts  for  producing  low-rank  approximations  of  the  inverted  Colorado  graph  network 
Laplacian.  From  left  to  right  the  plots  show  FLOPS  for  J^f3(A,Xo),  Xf4iA,Xo),  Xf2(A,A^Xo),  and  random 
projections.  FLOPS  for  sparse  matrix  operations  are  considered  separately  from  FLOPS  for  dense  linear 
algebra  operations. 
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--  /q(A,X„)/rardom  projection  FLOPS 
-  -  iq(A, X„)/rardom  projection  FLOPS 
K2(A,A^  X„)/random  projection  FLOPS 

--  K’4(A,X„)/random  projection  FLOPS 
-  -  X, (A, X„)/random  projection  FLOPS 
.  X2(A,A^  X„)/random  projection  FLOPS 

K’4(^,X„)/random  projection  FLOPS 
K, (A, X„)/random  projection  FLOPS 
X2(A,A^  X„)/random  projection  FLOPS 


Figure  4.14:  Ratios  of  sparse  (left),  dense  (middle)  and  total  (right)  FLOPS  for  producing  a  low-rank  approxi¬ 
mation  to  the  inverted  Colorado  road  network  Laplacian. 


different  methods.  Figure  4.14  shows  the  ratios  between  the  total  FLOP  counts.  Although  random 
projections  holds  advantages  for  small  subspace  dimensions,  the  shrink-and-iterate  and  hybrid  ap¬ 
proaches  obtain  FLOP  parity  with  random  projections  at  larger  dimensions. 

The  loss  of  orthogonality  for  this  matrix  was  also  similar  to  the  loss  of  orthogonality  seen  in  the 
Yale  eigenface  experiment;  this  is  shown  in  Figure  4.15.  Without  refinement,  some  non-negligible 
orthogonality  was  lost,  even  in  the  first  iteration.  This  reduces  the  effectiveness  of  shortening  the 
length  of  the  Krylov  sequence  using  p  power  iterations  in  the  hybrid  approach,  even  though  the 
Krylov  sequence  length  was  reduced  to  2.  However,  the  addition  of  the  refinement  operations  to 
Lanczos  iteration  improves  stability  by  an  order  of  magnitude  for  sequences  of  length  greater  than  2, 
and  renders  the  two  Lanczos  iterations  in  the  hybrid  approach  just  as  stable  as  random  projections. 

Depending  on  the  application  of  the  commute-time  embedding,  the  error  of  the  low-rank  approx¬ 
imation  may  directly  imply  the  quality  of  the  low-rank  approximation.  For  example,  the  commute¬ 
time  distance  d{u,w)  between  two  verices  u  and  u;  in  a  graph  is  given  in  terms  of  the  psuedoinverse 
of  the  Laplacian  matrix: 

(4.2)  d{u,w)  =  e^L^eu 

where  is  the  element  vector  and  assuming  an  appropriate  mapping  of  dimensions  of  the  vec- 
torspace  containing  L  to  vertices  in  the  graph.  For  the  low-rank  approximation  of  the  commute-time 
operator,  the  nuclear  norm  approximation  error  implies  the  error  of  the  approximated  commute-time 
distance  operator.  We  remark  that  there  are  similar  distance  operators  in  other  domains,  such  as 
the  Mahalanobis  distance  operator  used  in  statistics  and  signal  processing. 
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subspace  dimension 


subspace  dimension 


subspace  dimension  subspace  dimension 


Figure  4.15:  Maximum  loss  of  orthogonality  in  projection  basis  for  the  Colorado  road  network  Laplacian; 
random  projections  versus  classic  Lanczos  (left)  and  random  projections  versus  the  hybrid  random  projections- 
Krylov  subspace  method  using  refined  block  Lanczos  (right). 
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Figure  4.16:  Grand  tour  of  first  20  dimensions  of  commute-time  embedding  of  low-rank  approximation  of  the 
Colorado  Road  network  graph  in  ,Xq). 

For  other  applications  of  the  commute-time  embedding,  the  Frobenius  or  nuclear  norm  approxi¬ 
mation  error  does  not  directly  imply  quality.  We  consider  one  such  application  of  the  commute-time 
embedding:  visualization  of  a  graph.  For  visualization,  convergence  of  the  eigenvectors  is  just  as  im¬ 
portant  as  the  approximation  error.  This  distinction  between  approximation  error  and  convergence 
of  eigenvectors  is  important,  as  the  low-rank  approximation  methods  under  consideration  here  may 
have  substantially  differing  degrees  of  convergence  of  eigenvectors,  even  when  low-rank  approxima¬ 
tions  have  equivalently  low  error  in  the  Frobenius  or  nuclear  norm. 

To  illustrate  the  difference  in  low-rank  approximations  which  require  eigenvector  approxima¬ 
tions,  we  show  the  grand  tour  of  the  first  20  dimensions  commute-time  embedding  of  the  Colorado 
road  network  graph  as  approximated  by  random  projections,  and  the  Krylov  subspace  approxima¬ 
tion  in  ,Xq).  Figure  4.16  shows  the  commute-time  embedding  generated  with  4  iterations  of 

the  block  Lanczos  method  with  a  block  size  of  10,  and  4.17  shows  the  commute  time  embedding  as 
generated  with  the  random  projection  method  in  Algorithm  3.  These  may  be  compared  against  the 
low-rank  approximation  produced  with  the  truncated  spectral  decomposition  in  Figure  4.18.  It  is 
evident  from  the  plots  in  Figures  4.16  and  4.17  that  the  Krylov  subspace  approximation  produces  a 
better  visual  approximation  to  the  layout  produced  by  the  truncated  spectral  decomposition. 


76 


4.  Applications  of  short  block  Krylov  subspaces 


w 

w 

Figure  4.17:  Grand  tour  of  first  20  dimensions  of  commute-time  embedding  of  low-rank  approximation  of  the 
Colorado  Road  network  graph  generated  with  random  projections. 


f-~  ^ 

Figure  4.18:  Grand  tour  of  first  20  dimensions  of  commute-time  embedding  of  low-rank  approximation  of  the 
Colorado  Road  network  graph  using  the  truncated  spectral  decomposition. 
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Figure  4.19:  Leading  100  eigenvalues  (left)  and  leading  20  gaps  (right)  of  A  ^  for  the  crankshaft  stiffness 
matrix  problem. 


4.4  Stiffness  matrix  experiments 

Thus  far,  we  have  considered  matrices  for  which  sparse  matrix-vector  products  are  all  inexpensive 
relative  to  the  dense  linear  algebra  operations  in  the  random  projections  or  Lanczos  algorithms. 
In  many  cases,  the  sparse  matrix-vector  products  are  the  principal  expense  in  either  the  Lanczos  or 
random  projection  method.  When  these  costs  are  principal,  then  a  new  set  of  tradeoffs  are  introduced, 
but  the  shrink-and-iterate  and  hybrid  approaches  still  have  advantages.  These  advantages  are  a 
direct  result  of  the  smaller  number  of  sparse  matrix-vector  products  necessary  to  produce  a  low-rank 
approximation.  In  fact,  the  shrink-and-iterate  method  with  three  block  Lanczos  iterations  requires 
75%  fewer  sparse  matrix-vector  products  than  the  random  projection  method. 

To  study  a  case  for  which  sparse  matrix-vector  products  are  not  inexpensive  compared  to  other 
dense  linear  algebra  operations,  we  use  the  binwcra_l  matrix  from  the  University  of  Florida’s  sparse 
matrix  collection  [18],  which  is  a  finite-elements  stiffness  matrix  representing  an  automotive  crank¬ 
shaft.  The  matrix  is  148, 770  x  148, 770  square  and  positive  definite,  with  10, 644, 002  nonzero  entries. 
The  matrix  does  admit  a  sparse  Cholesky  factorization,  but  with  considerable  fill-in;  the  resulting 
Cholesky  factors  have  10,644,002  nonzero  elements  and  consume  nearly  4  GB  of  memory.  The  lead¬ 
ing  100  eigenvalues  of  the  inverted  stiffness  matrix  and  leading  20  eigenvalue  gaps  are  shown  in 
Figure  4.19. 

Solution  of  the  resulting  triangular  system  is  therefore  not  much  less  expensive  than  the  matrix- 


78 


4.  Applications  of  short  block  Krylov  subspaces 


matrix  products  and  QR  factorizations  used  for  low-rank  approximation.  We  would  expect  this  to 
diminish  or  eliminate  the  time  advantage  of  the  shrink-and-iterate  approach.  When  the  cost  of  the 
sparse  matrix-vector  products  necessary  to  form  AQi  are  considered,  then  the  speedup  gained  by 
a  reduction  in  block  size  is  limited  by  the  proportion  of  costs  of  sparse  matrix  operations  to  dense 
matrix  operations,  as  per  Ahmdahl’s  argument.  The  cost  of  each  Lanczos  iteration  is  dominated  by 
the  sparse  matrix-vector  products  to  form  AQi  rather  than  the  operations  to  form  A;  or  Qi+iBi+i; 
little  speedup  can  be  expected  from  simply  shrinking  the  block  size. 

More  expensive  sparse  matrix-vector  products  compared  to  dense  linear  algebra  operations  prod¬ 
ucts  change  the  compute  cost  relationship  between  random  projections  and  block  Lanczos  iteration; 
however,  it  does  not  render  the  shrink-and-iterate  block  Lanczos  approach  at  a  significant  disad¬ 
vantage.  The  compute  advantage  of  Block  Lanczos  in  these  cases  is  that  it  generates  the  projection 
T  of  A  in  the  working  subspace  Jifi{A,Xo)  as  iteration  proceeds,  as  opposed  to  the  random  projec¬ 
tion  method  that  first  must  generate  a  basis  for  spanlAH}  before  generating  the  projection  of  A  in 
spanlAH}.  Indeed,  the  random  projection  method  requires  just  as  many  sparse  matrix-vector  prod¬ 
ucts  as  the  shrink-and-iterate  approach  with  4  iterations,  and  the  shrink-and-iterate  approach  with 
3  iterations  requires  75%  of  the  sparse  matrix-vector  products.  The  hybrid  method  may  not  be  ad¬ 
vantageous  from  a  compute-time  perspective,  as  the  principal  driver  of  cost  is  sparse  matrix-vector 
products. 

The  experiments  performed  with  the  crankshaft  stiffness  matrix  coincide  with  the  expectations 
developed  with  consideration  of  the  sparse  matrix-vector  products  performed  in  the  random  projec¬ 
tions  and  block  Lanczos  methods.  The  more  expensive  sparse  matrix-vector  products  do  erode  the 
compute  advantages  of  initial  power  iteration  in  the  hybrid  approach,  so  2  block  Lanczos  iterations 
with  2  initial  power  iterations  are  not  much  better  than  4  block  Lanczos  iterations  from  a  time 
perspective,  though  they  are  better  from  a  loss-of-orthogonality  perspective.  Figure  4.20  shows  the 
FLOP  counts  for  generating  low-rank  approximations  of  the  inverted  crankshaft  matrix  for  various 
dimensions,  while  Figure  4.21  shows  the  FLOP  ratios  between  the  shrink-and-iterate  and  random 
projection  methods,  and  hybrid  and  random  projection  methods. 

The  eigenvalue  distribution  of  this  problem  has  large  gaps  at  the  extreme  of  the  spectrum;  these 


79 


4.  Applications  of  short  block  Krylov  subspaces 


Figure  4.20:  FLOP  counts  for  producing  low-rank  approximations  of  the  inverted  crankshaft  stiffness  matrix. 
From  left  to  right  the  plots  show  FLOPS  for  Jfs{A,Xo),  Xf4{A,Xo),  JF2(A,A^Xo),  and  random  projections. 
FLOPS  for  sparse  matrix  operations  are  considered  separately  from  FLOPS  for  dense  linear  algebra  opera¬ 
tions. 
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Figure  4.21:  Ratios  of  sparse  (left),  dense  (middle)  and  total  (right)  FLOPS  for  producing  a  low-rank  approxi¬ 
mation  to  the  inverted  crankshaft  stiffness  matrix. 

gaps  are  proportionally  even  larger  than  for  the  Yale  face  database  matrix.  Like  the  Yale  face  data, 
this  distribution  places  a  large  fraction  of  the  norm  in  the  leading  two  eigenvalues;  therefore,  a 
method  may  generate  a  good  low-rank  approximation  simply  by  approximating  them  well.  Fig¬ 
ure  4.22  shows  the  approximation  errors  for  low-rank  approximations  produced  by  random  projec¬ 
tions,  the  shrink-and-iterate  method  and  the  hybrid  method. 

4.5  Conclusion 

Low-rank  matrix  approximation  may  be  accomplished  with  a  truncated  SVD  or  spectral  decomposi¬ 
tion.  However,  computation  of  either  decomposition  is  likely  exorbitantly  expensive  for  direct  meth¬ 
ods,  or  even  for  iterative  methods  when  the  number  of  spectral  components  is  large  enough.  In  these 
cases,  approximations  of  eigenvectors  and  eigenvalues  may  produce  a  low-rank  approximation  with 
norm  rivaling  that  of  the  truncated  spectral  decomposition,  but  at  far  reduced  cost.  One  of  several 
methods  may  be  used  to  approximate  the  truncated  spectral  decomposition. 

Recently  there  has  been  interest  generated  in  power  iteration-like  random  projection  methods  for 
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Figure  4.22:  Low-rank  approximation  errors  of  inverted  crankshaft  stiffness  matrix;  random  projections  ver¬ 
sus  classic  Lanczos  (top  left)  and  random  projections  versus  the  hybrid  method  with  p-2  (top  right).  Squared 
Frobenius  norm  of  random  projections  and  block  Lanczos  iteration  compared  against  single-vector  Lanczos 
(bottom). 
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Hermitian  eigenproblems  and  compelling  results  have  been  reported.  These  methods  have  numerous 
favorable  attributes;  they  have  good  eigenvalue  approximation  properties,  they  have  moderate  com¬ 
putational  requirements,  they  lead  to  simple  computational  implementations,  and  they  are  immune 
to  the  loss  of  orthogonality  that  afflicts  Krylov  subspace  methods.  These  approximated  eigenvalues 
and  eigenvectors  may  be  used  to  generate  low-rank  matrix  approximations  at  substantially  reduced 
cost  compared  to  a  truncated  SVD  or  spectral  decomposition.  Krylov  subspace  methods  are  closely 
related  to  power  iteration,  and  the  success  of  power  iteration  combined  with  random  projections 
suggests  hybrid  approaches  that  combine  the  best  elements  of  each  approach. 

Though  random  projections  have  a  variety  of  favorable  attributes,  one  disadvantage  they  have  is 
their  computational  complexity.  The  computational  complexity  of  random  projection  methods  with 
power  iteration  scales  quadratically  with  increased  dimension  computed.  As  one  wishes  to  double  the 
dimension  of  a  low-rank  matrix  approximation,  the  computational  cost  of  using  random  projection 
methods  increases  four-fold  when  dense  linear  algebra  operations  dominate.  When  sparse  matrix- 
vector  products  are  inexpensive,  then  meaningful  gains  may  be  realized  by  block  size  shrinkage. 
Alternately,  when  sparse  matrix-vector  products  are  expensive,  then  the  number  of  sparse  matrix- 
vector  products  used  in  random  projections  becomes  a  disadvantage  compared  to  Krylov  subspace 
methods.  Random  projection  methods  require  p  -i- 1  block  vector  multiplications  against  A  :  p  to 
form  the  subspace  basis  and  one  to  project  A  into  that  basis.  The  block  Lanczos  method  produces  a 
projection  of  A  in  ^i{A,Xo)  as  it  iterates;  therefore,  3  block  Lanczos  iterations  or  one  power  iteration 
plus  two  Block  Lanczos  iterations  will  perform  fewer  sparse  matrix-vector  products  than  the  random 
projection  method.  Use  of  a  block  Krylov  subspace  rather  than  a  random  projection  space  can  lead 
to  compute  time  savings  whether  sparse  matrix-vector  products  are  inexpensive  or  expensive. 

The  random  projection  methods  also  suggest  new  ways  to  cope  with  the  loss  of  orthogonality 
experienced  in  Krylov  subspace  routines.  Low-rank  matrix  approximation  applications  of  Krylov 
subspaces  do  not  necessarily  require  convergence  eigenvalues.  Loss  of  orthogonality  is  directly  re¬ 
lated  to  the  length  of  the  Krylov  sequence,  so  when  convergence  of  eigenvalues  is  not  important, 
one  may  reduce  the  loss  of  orthogonality  by  increasing  the  block  size  and  reducing  the  number  of 
iterations.  The  number  of  iterations  may  also  be  reduced  with  a  hybrid  approach  that  initializes  the 
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Krylov  subspace  with  a  start  block  generated  with  power  iteration.  This  reduces  loss  of  orthgonality 
by  shortening  the  length  of  the  Krylov  sequence.  Further  stability  may  be  realized  by  adding  a  re¬ 
finement  step  to  the  block  Lanczos  process.  Use  of  both  initial  power  iteration  to  minimize  the  length 
of  the  Krylov  sequence  and  a  refinement  step  to  reduce  round-off  error  result  in  the  block  Lanczos 
process  being  as  stable  as  random  projections  for  low-rank  matrix  approximation.  This  added  sta¬ 
bility  totally  eliminates  the  need  for  restarts  or  selective  reorthogonalization  to  stabilize  the  block 
Lanczos  method,  and  thereby  render  it  far  less  complicated  than  current  practical  implementations. 
As  the  random  projection  method  uses  a  harmonic  Ritz  basis,  an  interesting  future  effort  would 
be  to  compare  the  performance  of  the  hybrid  method  against  explicit  restarts  using  harmonic  Ritz 
vectors  [41]  from  short  Lanczos  recurrences. 

Several  factors  may  influence  compute  times,  but  we  focus  on  a  theoretical  FLOP  count  analysis, 
with  a  distinction  between  sparse  and  dense  floating  point  operations.  With  respect  to  dense  floating 
point  operations,  both  the  shrink-and-iterate  and  hybrid  approaches  will  never  require  more  FLOPS 
than  random  projections;  for  3  iterations,  they  will  require  roughly  75%  of  the  dense  FLOPS  required 
for  random  projections.  The  case  is  identical  for  sparse  FLOPS,  as  both  the  hybrid  and  shrink  can 
construct  a  low-rank  approximation  of  A  restricted  to  the  Krylov  subspace  and  a  basis  for  the  sub¬ 
space  as  iteration  proceeds.  By  way  of  comparison,  one  feature  that  hampers  the  performance  of 
random  projection  methods  is  that  they  must  perform  at  least  p  -i- 1  sparse  matrix-vector  products 
to  generate  an  approximation  restricted  to  span{A^£l},  one  to  generate  an  orthonormal  basis  for  the 
subspace,  and  another  to  project  the  input  matrix  into  the  subspace.  This  feature  is  a  limitation 
when  sparse  matrix-vector  products  are  expensive,  as  is  the  case  for  the  inverted  crankshaft  matrix. 
Yet  in  these  cases,  the  block  Krylov  methods  developed  here  also  have  somewhat  of  a  disadvantage 
as  well,  as  they  must  perform  at  least  1.5^  sparse  matrix-vector  products.  If  the  cost  of  a  sparse 
matrix-vector  product  is  sufficiently  large,  there  will  be  no  difference  between  the  shrink-and-iterate 
method  with  4  iterations  and  random  projections.  Likewise,  there  will  not  be  a  substantial  difference 
between  the  hybrid  method  operating  in  Jlf2(A,  A^Yq)  and  the  random  projection  method.  Therefore, 
the  biggest  advantage  for  shrink-and-iterate  or  hybrid  methods  is  when  the  input  matrix  has  well- 
separated  leading  eigenvalues  and  is  sufficiently  sparse  as  to  allow  for  fast  sparse  matrix-vector 
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products.  In  these  cases,  the  hybrid  approaches  allow  for  both  better  convergence  and  possibly  faster 
compute  times. 
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Chapter  5 


GrABL:  A  Greedy  Adaptive  Block  Lanczos 
method 


In  the  preceding  chapters,  we  have  demonstrated  short  block  Krylov  subspace  methods  for  low- 
rank  approximation  problems;  both  the  “shrink- and-iter ate”  and  hybrid  Krylov  subspace-random 
projection  methods  offer  potentially  better  approximations  and  smaller  computational  costs  than 
random  projection  methods,  coupled  with  reduced  loss  of  orthogonality.  The  computational  time 
advantages  these  methods  hold  over  random  projections  is  a  direct  result  of  smaller  block  size;  as  we 
mentioned  in  Chapter  3,  shrinking  the  block  size  by  a  factor  of  s  leads  to  a  speedup  per  iteration. 
The  “shrink- and-iter  ate”  and  hybrid  short  block  Krylov  subspace  approaches  use  a  fixed  shrinkage 
factor  of  s  =  2,  and  perform  3  or  4  iterations.  They  therefore  have  a  fixed  cost  advantage  over  random 
projections. 

A  block  Lanczos  process  that  adaptively  grows  and  shrinks  block  size  can  leverage  large  blocks 
to  drive  convergence  of  eigenvalues  through  exploitation  of  local  gaps,  and  then  deflate  the  block  size 
to  reduce  the  cost  per  iteration.  An  adaptive  block  method  could  be  better  able  to  manage  computa¬ 
tional  costs  than  the  short  block  Krylov  subspace  methods  developed  in  the  previous  chapters,  and 
obtain  an  asymptotic  advantage  over  random  projections.  Additionally,  use  of  a  smaller  initial  block 
for  stationary  power  iteration,  as  is  used  in  the  hybrid  block  Krylov-random  projections  method  in 
Algorithm  5  will  result  in  reduced  computational  costs  due  to  sparse  matrix-vector  products.  This  ad¬ 
vantage  will  be  marked  in  the  cases  for  which  sparse  matrix-vector  products  are  dominant  compute 
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costs,  such  as  the  stiffness  matrix  experiments  presented  in  Section  4.4.  Nevertheless,  larger  degrees 
of  block  size  shrinkage  will  necessitate  more  iterations,  which  compromises  the  loss  of  orthogonality 
benefits  developed  for  the  short  block  Krylov  subspace  methods  in  the  previous  chapter,  which  may 
be  addressed  with  one  of  the  standard  partial  or  selective  orthogonalization  strategies  [57, 74]. 

Adaptively  blocked  Krylov  subspace  methods  have  been  developed  previously  [5,86].  Rather  than 
using  a  fixed  block  size,  the  algorithm  detects  clustering  of  Ritz  values  as  it  iterates,  and  inflates  the 
block  size  if  it  determines  that  the  number  of  clustered  Ritz  values  exceeds  the  current  block  size.  No¬ 
tably,  the  existing  adaptive  block  Krylov  subspace  methods  are  intended  for  accurate  computation  of 
eigenvalues.  Inflation  of  the  block  size  is  performed  to  allow  for  accurate  computation  of  eigenvalues 
that  are  tightly  clustered.  These  algorithms  do  not  inflate  to  accelerate  minimization  of  aggregate 
singular  value  error,  but  are  intended  to  run  until  convergence  of  a  set  of  eigenpairs  of  interest.  For 
the  relaxed  generic  reduction  problem,  it  is  tolerable  if  eigenvalues  and  eigenvectors  have  not  con¬ 
verged  completely,  but  use  of  a  minimal  Krylov  subspace  introduces  a  different  set  of  trade-offs  not 
present  in  the  eigenvalue  problem.  Most  importantly,  the  block  size  restricts  the  number  of  itera¬ 
tions  possible,  which  in  turn  restricts  the  convergence  of  eigenvalues  in  the  Krylov  subspace.  This 
restriction  changes  the  approach  to  inflation  and  deflation,  and  dictates  when  and  how  they  should 
occur. 

We  develop  GrABL,  an  adaptive-block  size  Lanczos  algorithm  specifically  for  low-rank  approxi¬ 
mation  problems.  GrABL  differs  from  the  existing  adaptive  block  Lanczos  methods  via  its  inflation 
method  and  deflation  criterion.  Its  inflation  method  is  informed  by  the  changes  in  eigenvalue  con¬ 
vergence  rates  due  to  increases  in  block  size,  and  assumes  eigenvalue  distributions  typical  of  generic 
low-rank  approximation  problems  and  the  random  projection  method.  Rather  than  increasing  block 
sizes  when  clustered  eigenvalues  are  encountered,  it  increases  block  size  only  when  the  added  block 
columns  increase  convergence  of  the  well-separated  leading  eigenvalues.  Rather  than  deflate  when 
linear  independence  of  any  Lanczos  block  is  lost,  it  deflates  immediately  after  detection  of  an  optimal 
block  size  to  obtain  the  best-possible  compute  advantage  for  successive  iterations. 
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eigenvalue  index 

Figure  5.1:  Eigenvalues  of  AA^  for  A  as  extended  Yale  eigenface  database.  Images  were  flattened  to  vectors. 

5.1  Typical  spectra  af  generic  dimensian  reductian  prablems 

Convergence  of  eigenvalues  in  Krylov  subspaces  depends  on  the  distribution  of  the  spectrum;  the 
typical  spectrum  of  generic  dimension  reduction  problems  happens  to  have  properties  that  enable 
fast  convergence  of  exactly  those  eigenvalues  and  eigenvectors  necessary  for  a  good  dimension  re¬ 
duction.  Typically,  in  a  generic  dimension  reduction  problem  such  as  PCA  or  LSI,  it  is  assumed  that 
there  exists  a  latent,  low-dimensional  problem  that  is  embedded  in  a  high-dimensional  space.  The 
high-dimensional  space  also  has  noise  added,  albeit  at  a  far  lower  magnitude  than  the  data.  The 
leading  eigenvectors  are  assumed  to  be  a  basis  for  the  latent  space,  and  the  remaining  dimensions 
are  assumed  to  be  noise.  The  magnitudes  of  the  latent  space  vectors  and  the  noise-space  vectors  will 
reflect  the  signal  to  noise  ratio  of  the  problem.  Convergence  of  eigenvalues  in  a  Krylov  subspace  are 
driven  by,  among  other  things,  eigenvalue  gaps.  Therefore,  the  latent  space  vectors  will  converge 
quickly  in  a  Krylov  subspace.  In  the  vast  majority  of  data  reduction  applications,  there  is  not  an 
abrupt  step  between  the  latent  space  eigenvalues  and  the  noise  eigenvalues.  Rather,  eigenvalue 
gaps  decrease  rapidly  for  the  first  few  gaps,  and  then  decrease  more  slowly  after  that.  Figure  5.1 
shows  the  spectrum  of  an  eigenface  data  set,  and  is  characteristic  of  generic  dimension  reduction 
problems.  Spectra  such  as  shown  in  Figure  5.1  are  essential  to  the  remainder  of  this  chapter.  It  is 
assumed  that  only  a  few  leading  eigenvalues  are  well-separated  from  the  rest,  with  tight  clustering 
encountered  early  in  the  spectrum.  These  properties  both  give  good  convergence  to  block  methods, 
but  also  render  them  computational  over-kill  when  there  are  fewer  well-separated  eigenvalues  than 
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number  of  columns  in  the  start  block. 

5.2  Classic  block  Lanczos 

If  a  priori  knowledge  of  the  spectrum  is  available  to  determine  an  optimal  block  size,  one  may  simply 
apply  the  classic  block  Lanczos  algorithm  to  produce  a  minimal  Krylov  subspace.  The  classic  block 
Lanczos  algorithm  is  presented  in  Algorithm  2.  We  remark  that  the  classic  block  Lanczos  algorithm 
is  elegant  in  its  most  abstract  instantiation.  In  exact  arithmetic,  the  Lanczos  algorithm  guarantees 
orthogonality  between  the  Lanczos  blocks  Qi,  but  practical  implementations  include  some  measures 
to  recover  orthogonality  lost  from  finite-precision  operations.  Provided  that  the  matrix  A  is  sparse, 
the  costs  of  each  Lanczos  iteration  are  dominated  by  the  dense  matrix-matrix  operations  at  lines  4,  5 
and  7,  which  require  Oink^)  operations. 

5.3  ABLE:  adaptively  blacked  Lanczas  far  the  eigenprablem 

It  is  rare  that  sufficient  knowledge  of  the  spectrum  may  be  known  before  the  fact  or  is  easily  deducible 
from  the  matrix.  This  motivates  development  of  block  Lanczos  methods  that  can  adapt  block  size  as 
iteration  progresses  without  discarding  any  previously  computed  information  about  the  spectrum 
of  A.  ABLE  [4,  5]  is  one  such  algorithm  for  the  non-Hermitian  eigenproblem.  ABLE  uses  block 
size  inflation  to  recover  from  breakdown  and  extract  clustered  eigenvalues.  The  ABLE  algorithm 
simplified  for  the  Hermitian  case  is  presented  in  Algorithm  6.  It  is  notable  that  the  problems  of 
breakdown  and  tightly  clustered  eigenvalues  are  not  as  severe  in  the  Hermitian  case.  ABLE  is 
intended  to  solve  non-Hermitian  eigenproblems  and  cope  with  the  particular  problems  they  present. 
However,  if  one  were  simply  to  use  ABLE  for  a  low-rank  approximation,  it  is  not  clear  what  values 
should  be  used  for  t]  and  6k-  To  alleviate  the  requirement  for  knowledge  of  the  spectrum  of  A  to 
adapt  block  size,  we  note  that  one  may  monitor  convergence  of  Ritz  values  in  the  Krylov  subspace  to 
infer  the  size  of  the  local  gaps.  The  cost  of  ABLE  depends  on  the  block  size  history,  and  is  T,’i=i4:bjri  + 
2bjnnnz)  where  ABLE  takes  r  iterations,  and  bi  is  the  block  size  at  iteration  i.  Of  the  elements 
of  this  expression,  2knnnz  FLOPS  are  due  to  sparse  operations,  with  the  complement  due  to  dense 
operations. 
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Algorithm  6  ABLE:  Adaptive  block  Lanczos  for  the  eigenproblem 
Require;  a  priori  chosen  start  vector  Qi 

1:  R  ^AQi 

2:  for  J  =  1  —  ^  do 

3:  Aj^QjR 

4:  R^R-  QjAj 

5:  QR  factorize  R  =  Qj+iBj+i 

6:  spectral  decompose  Tj  = 

7:  detect  clustering  in  0,  if  \{dj  \  \9j  -dk  \  <'9}\  for  a  given  6k  and  tolerance  77  is  larger  than  the 

block  size,  inflate  Qj  with  Q  L  Q;  for  all  j  <  j 

8: 

9:  R 

10:  end  for 
11:  return  Q,T 


5.3.1  Convergence  of  eigenvalues  in  block  Krylov  subspaces 

Convergence  of  eigenvalue  approximations  for  the  projection  of  A  into  JCj(A,A’o)  is  driven  both  by 
block  size  r  and  Krylov  sequence  length  i;  therefore,  convergence  may  be  achieved  either  by  increas¬ 
ing  i  or  r.  Computational  costs  are  also  governed  by  i  and  r;  the  cost  to  generate  a  minimal  Krylov 
subspace  of  dimension  k  is  asymptotic  with  kr^/r  =  kr,  disregarding  the  costs  of  stabilization  and 
the  dimension  of  A.  As  ^  is  fixed,  doubling  the  block  size  will  double  the  compute  cost  to  generate 
the  Krylov  subspace.  The  random  projection  method  may  have  good  results,  but  is  also  the  most 
expensive  method.  Naturally,  we  would  like  to  have  eigenvalues  converge  as  quickly  as  possible 
given  a  fixed  subspace  dimension,  subject  to  minimal  computational  costs.  This  introduces  a  new 
set  of  trade-offs  between  single-vector  and  block  Krylov  subspaces.  Eigenvalues  may  converge  faster 
in  a  block  Krylov  subspace  if  spectral  properties  are  favorable,  but  the  larger  block  size  introduces 
more  computational  costs  compared  to  a  smaller  block  size  coupled  with  a  longer  Krylov  sequence. 
Furthermore,  the  minimal  Krylov  subspace  restriction  limits  the  number  of  iterations  available  to  a 
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block  method  more  than  it  limits  the  number  of  iterations  available  to  a  single  vector  method.  Fewer 
iterations  restrict  the  convergence  of  eigenvalues  in  the  subspace,  which  requires  that  the  improve¬ 
ment  in  convergence  due  to  local  gaps  outweigh  the  loss  in  convergence  due  to  restrictions  on  the 
number  of  iterations. 

The  use  of  a  minimal  or  near-minimal  Krylov  subspace  is  a  central  feature  of  this  effort;  limits 
on  convergence  determine  those  spectra  for  which  a  block  Krylov  subspace  is  advantageous  versus  a 
single-vector  Krylov  subspace.  The  cases  for  which  a  block  method  is  most  advantageous  are  some¬ 
what  counter-intuitive  for  minimal  Krylov  subspaces.  Rather  than  being  most  appropriate  when 
eigenvalues  are  tightly  clustered,  they  are  advantageous  when  they  are  well-separated.  If  the  lead¬ 
ing  part  of  the  spectrum  of  A  is  sufficiently  flat  up  to  desired  minimal  subspace  size  k,  then  block 
methods  cannot  achieve  tight  error  bounds  without  iteration.  Conversely,  if  the  leading  eigenvalues 
are  all  mutually  well-separated,  then  good  error  bounds  are  possible  through  block  size  alone.  Before 
discussing  the  convergence  of  eigenvalues  in  Krylov  subspaces,  we  recall  the  error  bounds  for  block 
Krylov  subspace  projections  due  to  Underwood  [81]  in  Theorem  3. 

We  remark  that  the  Underwood  theorem  may  be  stated  with  the  substitution  of  with 


(5.1) 


Tum= 


T 


2 

i-1 


with 

(5.2) 


f,  =  l  +  2 


"^r+l 

'Ir+l  ~  d-inf 


though  these  two  formulations  are  equivalent. 

These  bounds  show  what  properties  of  the  spectrum  of  A  and  the  Krylov  subspace  shrink  errors 
of  eigenvalue  approximations.  Making  the  denominator  in  (2.3)  large  will  drive  the  error  Xj  —  to 
zero;  this  is  accomplished  by  either  increasing  the  order  of  the  Chebyshev  polynomial  or  by  increas¬ 
ing  Yj-  These  increases  are  accomplished  by  increasing  the  Krylov  sequence  length  or  increasing  the 
block  size,  respectively.  When  the  leading  s  eigenvalues  are  all  mutually  well-separated,  then  dra¬ 
matic  improvements  may  be  realized  by  setting  the  block  size  r  =  s.  For  leading  eigenvalues,  rapid 
convergence  will  also  result  from  lengthening  the  Krylov  sequence,  as  the  Chebyshev  polynomial 


90 


5.  GrABL:  A  Greedy  Adaptive  Block  Lonczos  method 


grows  rapidly  with  increasing  order  i  —  1. 


5.3.2  Convergence  via  block  size  versus  convergence  via  iterations 

The  random  projection  algorithm  in  Algorithm  3  uses  the  maximal  block  size,  and  performance  is 
good  when  eigenvalues  are  well-separated.  However,  when  the  leading  r  eigenvalues  are  not  well- 
separated,  convergence  in  a  short  Krylov  subspace  may  be  poor.  Clearly,  if  the  leading  spectrum  of 
A  is  such  that  Ai  =  Aj  for  1  <  i  <  r,  then  all  ji  will  be  equal  for  all  block  sizes  less  than  r.  We  would 
like  to  show  that  if  the  leading  part  of  the  spectrum  is  “sufficiently  flat”  —  that  is,  all  eigenvalues 
are  tightly  clustered  and  have  small  local  gaps  —  then  the  improvement  in  convergence  due  to  block 
size  is  comparatively  small.  To  that  end,  we  prove  the  following  theorem. 

Theorem  10.  Suppose  Ai  >  A2  >  . . .  >  Ainf.  Use  fi^s  for  1  +  -A^V  '^hen  the  improvement  in  fi  os 
defined  in  (2.3)  is  given  by 


(5.3) 


fi,r  _  (Ainf  ^i+r+  i)i2Ai  Aj+r  Ajnf) 

Tj,r+1  (Ainf  ~  Ai+r)(2Ai  —  Ai+r+1  ~  Ainf) 


Proof  We  have 
(5.4) 


1  +  2 


Aj  -  Aj+i 

Aj+l  “  Ainf 


by  its  definition  in  (2.3).  Simplifying  the  ratio  f  i,r/f  i,r+l  immediately  jdelds 


(5.5) 


Ji,r  _  (Ainf  i)(2Ai-A  j+r  Ainf) 

Yi.r+l  (Ainf  ~  Ai+r)(2Ai  —  Aj+^+l  ~  Ainf) 


which  completes  the  proof  □ 

An  implication  of  this  theorem  is  that  tightly  clustered  eigenvalues  cause  minimal  returns  for 
eigenvalue  convergence  due  to  block  size  inflation.  Such  eigenvalue  clustering  is  common  in  interior 
eigenvalues  in  a  variety  of  low-rank  approximation  problems.  Note  that  Aj+r  -  Aj+^+i  <  c  is  a  suffi¬ 
cient  condition  for  (5.3)  to  be  close  to  1.  Conversely,  neither  Aj+r/Aj  nor  Aj+r+i/A;  need  be  small  to 
ensure  (5.3)  be  close  to  1. 
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5.4  Greedy  Adaptively-blocked  Lanczos 

The  block  Lanczos  method  has  notable  advantages  over  the  single-vector  Lanczos  algorithm;  the 
advantage  most  relevant  to  this  effort  is  the  more  rapid  convergence  of  eigenpairs  that  are  well- 
separated  from  the  remainder  of  the  spectrum.  If  the  convergence  is  sufficiently  fast,  the  block 
method  will  produce  an  approximation  with  smaller  error  than  the  single  vector  method.  The  block 
method  will  also  require  fewer  iterations  to  produce  a  minimal  Krylov  subspace  of  equal  size;  fewer 
iterations  imply  reduced  computational  cost  and  reduced  need  for  stabilization  to  compensate  for 
round-off  error.  Increasing  block  size  is  not  without  drawbacks;  the  computational  cost  of  each  Lanc¬ 
zos  iteration  is  proportional  to  r^i,  where  r  is  the  block  size.  Increasing  the  block  size  must  balance 
the  increase  in  convergence  against  the  increase  in  costs.  To  minimize  compute  time,  we  would  like  to 
employ  the  random  projection  method  to  approximate  only  those  eigenvalues  that  are  well-separated 
such  that  convergence  may  be  driven  by  block  size,  and  transition  to  single-vector  Lanczos  iteration 
to  generate  the  remaining  basis  vectors.  By  doing  so,  we  should  get  both  the  good  convergence  of  the 
random  projection  method  on  the  leading  eigenvalues  that  are  well-separated,  and  the  fast  iteration 
of  single-vector  Lanczos.  We  may  also  incorporate  the  power  refinement  of  the  random  projection 
method  to  only  the  well-separated  eigenvalues  it  will  benefit  most,  and  transition  to  single-vector 
Lanczos  on  A  without  refinement  to  further  maximize  approximation  performance  subject  to  com¬ 
putational  costs.  Computational  costs  are  managed  both  by  reducing  dense  floating-point  operations 
due  to  block  size,  and  also  by  only  performing  stationary  power  iteration  on  a  block  of  size  m  <  k. 
Likewise,  we  can  leverage  the  feature  of  Lanczos  methods  that  constructs  a  low-rank  approximation 
as  they  iterate,  rather  than  first  constructing  a  basis  and  then  projecting  the  input  matrix  into  it. 

5.5  Randam  prajectlans  and  arthaganal  Lanczas  Iteratlan 

Though  ABLE  has  mechanisms  to  adaptively  grow  (or  deflate)  its  block  size  at  each  iteration,  we  have 
observed  that  it  is  not  immediately  useful  for  producing  minimal  Krylov  subspaces  that  effect  good 
low-rank  matrix  approximations.  Random  projections  offer  good  eigenvalue  approximation  proper¬ 
ties  with  excellent  stability,  but  are  expensive.  By  way  of  comparison,  Lanczos  iteration  offers  a  lower 
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computational  complexity  than  the  direct  eigenvalue  approximation  method,  but  with  an  unavoid¬ 
able  loss  of  stability  with  continued  iteration.  All  of  these  observations  suggest  a  hybrid  approach 
that  combines  the  stability  and  good  eigenvalue  approximation  of  the  random  projection  method  in 
Algorithm  3  and  the  better  computational  complexity  of  classic  Lanczos  iteration  in  Algorithm  2. 

One  of  the  simplest  hybrid  approaches  would  be  to  directly  combine  Algorithm  3  and  Algorithm  2 
such  that  they  operate  on  orthogonal  subspaces.  Thus  we  may  compute  one  low-rank  approximation 
~  A  and  then  compute  another  approximation  =  A  —  A^^^^  The  second  algorithm  would 

benefit  from  the  reduced  spectral  norm  ||A  — This  reduction  may  be  substantial  when  A*^™^ 
approximates  the  leading  eigenpairs  of  A  well. 

The  resulting  hybrid  algorithm  is  presented  in  Algorithm  7.  The  algorithm  first  applies  Algo¬ 
rithm  3  to  approximate  the  leading  m  eigenvalues  of  A,  and  then  proceeds  with  deflated  single-vector 
Lanczos  iteration  as  described  in  Algorithm  8  to  obtain  maximum  speed  benefits.  Needless  to  say, 
the  switch-over  parameter  m  determines  the  cost  of  the  initial  random  projection  phase  of  the  algo¬ 
rithm  and  the  stability  of  the  subsequent  Lanczos  iterations.  In  terms  of  compute  time.  Algorithm  7 

Algorithm  7  Random  projections  with  orthogonal  Lanczos  iteration 
Require:  input  matrix  A,  switchover  parameter  m,  subspace  sequence  length  k 

1:  generate  random  sample  matrix  Xq  with  m  -I- 1  columns. 

2:  obtain  V,Q  from  Algorithm  3  with  A  and  Xq. 

3:  obtain  Q,T,Z  from  k-m  iterations  of  Algorithm  8  with  A,  X  =  [vi  V2  ...  Vjn\,  and  start  vector 


Vm+l- 

4:  return  [17  Q], 


0  Z 
Z^  T 


has  better  computational  complexity  than  the  random  projection  method,  but  not  as  good  as  single 
vector  Lanczos.  To  generate  a  ^-dimensional  subspace  and  approximation,  single-vector  Lanczos  has 
complexity  kO{n),  while  the  random  projection  method  has  complexity  0(nk^).  Algorithm  7  uses  the 
random  projection  method,  but  to  generate  an  m-dimensional  space;  the  complexity  of  that  operation 
is  0{nm^).  The  successive  k—m  deflated  Lanczos  iterations  jointly  require  4knm—2km+2knnnz+5kn 
FLOPS.  Thus,  the  speedup  is  determined  by  how  m  compares  to  k.  For  example  in  the  extreme  case 
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Algorithm  8  Deflated  single-vector  Lanczos 

Require:  input  matrix  A,  subspace  basis  X  with  j  columns,  start  vector  qq,  sequence  length  k 
1:  zi^X'^qi 
2:  qi^qi-Xzi 
3:  w  ^  Aqi 
4:  ai  q'^w 
5:  W  ^  W-  qiUi 
6:  ;6i  <—  ||u;|| 

7:  qi^w/Pi 
8:  for  i  =  2  ^  ^  do 
9:  Zi^X'^qi 

10:  qi^qi-Xzi 

11:  w  ^  Aqi 

12:  w  ^w-qtPi-i 

13:  aj  ^  qjw 

14:  w  ^w-  qta 

15:  Pi^WwW 

16:  qi+i^w/Pi 

17:  end  for 

ai  h 

P2  0C2  Ps 

h  «3  )64 


18:  return  [qi,...,qk\, 
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m  =  1,  then  the  cost  is  the  same  as  single-vector  Lanczos  iteration.  In  general,  Algorithm  7  costs 
4m^  -I-  2m^  -i-  4mkn  —  2mk  +  2mnnnz  ^  6mn  -i-  2kn-nnz  +  5kn  FLOPS. 

Though  Algorithm  7  does  effectively  combine  the  random  projection  and  classic  Lanczos  methods, 
the  key  limitation  is  that  the  value  of  the  switchover  parameter  m  is  left  to  be  determined  by  the  user. 
To  minimize  compute  costs,  one  would  like  to  choose  m  to  be  as  small  as  possible,  but  big  enough  so 
that  the  well-separated  and  large  eigenvalues  are  all  approximated  by  the  random  projection  method. 
Intuitively,  we  may  say  that  we  would  like  the  random  projection  method,  which  has  good  eigenvalue 
approximation  properties,  to  approximate  those  eigenpairs  which  matter  most  to  approximating  A 
well.  Though  in  actuality,  the  choice  of  m  is  more  complicated  than  that,  there  is  no  clear  choice  of 
m  based  on  beforehand  knowledge  of  A.  To  better  understand  how  to  choose  m  well,  we  consider 
the  relationship  between  convergence  of  eigenvalues  and  block  size.  With  that  information,  we  may 
choose  a  value  for  m  that  represents  the  best  compromise  between  compute  time  and  approximation 
quality. 

5.5.1  Choosing  m  automatically 

Our  criterion  for  adaptive  inflation  of  the  Krylov  subspace  leverages  our  assumed  spectral  distribu¬ 
tion.  We  suppose  that  leading  eigenvalue  gaps  are  large  in  proportion  to  their  eigenvalues  initially 
and  rapidly  become  proportionally  smaller  approaching  the  interior.  This  is  typical  of  eigenvalue 
distributions  encountered  in  generic  dimension  reduction  tasks.  Figure  5.2  shows  a  theoretical  but 
representative  eigenvalue  distribution.  Suppose  we  are  to  generate  a  40-dimensional  minimal  Krylov 
subspace.  As  noted  previously,  extending  the  block  size  beyond  8  will  not  lead  to  improved  conver¬ 
gence,  especially  for  the  leading  8  eigenvalues.  However,  if  a  block  size  of  8  is  used,  the  bounds  from 
Halko  et  al.  suggest  that  the  approximations  obtained  by  the  random  projection  algorithm  will  be 
good.  Subsequent  Lanczos  iterations  using  a  block  size  of  8  operating  on  A  —  A^®^  as  deflned  previ¬ 
ously  will  not  yield  much  of  an  advantage  over  single  vector  Lanczos  due  to  the  block  size,  as  the  gaps 
between  t'  and  t  are  proportionally  small.  Ideally,  we  would  like  to  perform  the  random  projection 
algorithm  with  a  block  size  of  8,  and  then  deflate  the  block  size  to  1  for  the  remaining  iterations. 
In  order  to  do  this  automatically,  we  require  a  method  to  detect  the  point  at  which  eigenvalue  gaps 
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Figure  5.2:  Typical  eigenvalue  distribution.  Extension  ofblock  size  beyond  t'  does  not  impart  many  advantages 
for  the  leading  8  eigenvalues. 


become  proportionally  small,  assuming  an  eigenvalue  distribution  with  a  shape  similar  to  the  distri¬ 
bution  in  Figure  5.2.  Error  bounds  for  eigenpairs  in  the  Krylov  subspace  allow  one  to  reason  about 
local  eigengaps,  albeit  indirectly  and  in  the  presence  of  “noise”  from  convergence  due  to  the  random 
start  block.  In  Figure  5.2,  the  ideal  block  size  is  on  the  point  at  which  local  gaps  become  small,  this  is 
also  the  point  at  which  additional  dimensions  will  result  in  dramatically  decreasing  convergence  of 
the  principal  eigenvalue  in  the  random  projection  method.  More  formally,  we  recall  that  the  bounds 
for  the  principal  eigenvalue  Ai  are  given  by 


(5.6) 


0  <  Ai  -  A^”^ 


<  (Ai  -  Ainf) 


tan^d 


with 

(5.7) 


fi 


1  +  2 


Aj-Ar+l 

Ar+1  ~  Ainf 


The  effect  of  these  gaps  on  the  value  of  f  i,r/f  i,r+i  from  the  above  eigenvalue  distribution  is  shown 
in  Figure  5.3.  This  value  represents  the  change  in  worst-case  bounds  on  eigenvalue  approximation 
error  due  to  block  size  increases.  A  large  value  of  indicates  that  block  size  increases  pro¬ 

duce  only  a  small  additional  convergence  of  the  eigenvalue  Af.  The  actual  value  of  is  shown 
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Figure  5.3:  Values  of  7i,r/f  i,r+i  for  eigenvalue  distribution  given  in  Figure  5.2. 

in  Figure  5.4.  The  actual  convergence  in  follows  the  trend  of  the  worst-case  bounds  predicted 
by  f  i,r/f  i,r+i;  when  local  gaps  Ai  -  A^+i  fail  to  produce  an  increase  in  fi  r,  then  the  convergence  of 
the  eigenvalue  estimate  stagnates.  This  suggests  that  block  size  increases  over  some  point  —  3  in 
this  example  —  will  not  substantially  add  to  the  convergence  of  the  principal  eigenvalue.  There¬ 
fore,  compute  costs  benefits  may  be  realized  by  using  a  smaller  block  size  than  k  to  produce  ^  >  3 
dimensions. 

The  above  discussion  lacks  some  rigor;  however,  it  is  intended  to  motivate  the  following  formal 
discussion.  We  have  already  established  with  Theorem  10  that  eigenvalue  distributions  that  have 
properties  similar  to  those  shown  in  Figure  5.2  —  large  leading  local  gaps  that  decrease  rapidly 
towards  the  interior  of  the  spectrum  —  will  encounter  eigenvalue  stagnation  if  only  block  size  growth 
is  used  to  drive  convergence.  Our  approach  will  then  track  the  convergence  of  eigenvalues  with  block 
size  inflation,  using  the  following  proposition. 

Proposition  7.  Let  Ai  >  A2  >  . . .  >  Ainf  be  real,  positive  eigenvalues  and  r  and  i  be  natural  numbers. 
We  have  Ai  -  A^+i  >  Aj  -  A^+i  for  any  i  and 


(5.8) 


fi  =  l  +  2 


Al  Ar+l  o  -^r+l 

>1+2,^ - —  =  ri 


'•r+l  ■ 


"^inf 


'fr+l  4inf 
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Figure  5.4:  Values  of  from  random  projections  at  various  dimensions  for  eigenvalue  distribution  given  in 
Figure  5.2. 

Therefore,  the  bounds  on  Ai  are  proportionally  no  looser  than  the  bounds  for  any  other  eigen¬ 
value,  and  stagnation  of  the  leading  eigenvalue  implies  stagnation  of  the  interior  eigenvalues.  This 
allows  us  to  track  only  the  convergence  of  Ai. 

As  the  error  for  approximation  of  the  principal  eigenvalue  Ai  —  is  smaller  than  the  errors  for 
more  interior  eigenvalues,  it  is  sufficient  to  simply  track  the  convergence  of  Ai  and  monitor  for  stag¬ 
nation.  This  allows  for  adaptive  generation  of  the  initial  start  block  in  Algorithm  7;  the  initial  start 
block  on  line  1  may  be  generated  iteratively.  As  each  dimension  is  added,  the  principal  eigenvalue 
A^^^^  of  approximation  may  be  compared  to  those  from  the  last  few  previous  iterations.  Once  the 
increase  in  the  magnitude  of  A^^  stagnates,  then  we  assume  that  j\  is  not  increasing  much.  An 
iterative  inflation  procedure  is  described  in  Algorithm  9.  This  algorithm  iteratively  builds  a  basis  for 
spanlAAol  for  a  random  Xq,  while  tracking  the  convergence  of  the  principal  eigenvalue.  Each  itera¬ 
tion  generates  a  new  random  vector,  multiplies  it  against  A,  and  orthogonalizes  against  the  previous 
subspace  via  the  Gram-Schmidt  process  on  lines  6  and  7.  Lines  12  and  13  represent  measuring  the 
cosine  angle  between  the  length-m  vector  e  whose  entries  are  all  1  and  the  last  m  approximations  of 
Ai.  Once  this  angle  becomes  sufficiently  small,  inflation  terminates.  This  approach  may  be  charac¬ 
terized  as  greedy,  as  it  will  stop  inflation  at  the  first  stagnation.  Non-monotonicity  of  local  gaps  may 
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Algorithm  9  Greedy  adaptive  block  inflation 

Require;  input  matrix  A,  stagnation  tolerance  p,  stagnation  window  w 

1:  j  - 1 

2:  A^O 

3:  while  A  <  p  do 

4:  generate  random  vector  x 

5:  Qj  ^  Ax 

6:  qj^qj-Qj-iQ^.iQj 

7:  qj^qjlWqjW 

8:  A^^^QJAQj 

9:  spectral  decompose  V QV^  =  A^-^^ 

10:  -  01 
11:  ifj  >  w  then 

12:  a  -  (0O'-“')2  +  0U-w+l)2  0(1)2) 

13:  A  -  (0<'^'-“'>  +  0O-«'+i)  + . . .  +  9^J^)/w^a 

14:  end  if 

15:  J-J  +  1 

16:  end  while 
17:  return  A^^\V 
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result  in  temporary  stagnation  followed  by  a  resumption  in  convergence.  Nevertheless,  this  greedy 
behavior  is  advantageous  when  used  to  automatically  choose  the  block  size  m  for  Algorithm  7;  a 
smaller  m  results  in  less  cost  in  the  deflated  Lanczos  iteration. 

5.6  The  GrABL  algorithm 

Algorithm  9  provides  for  an  alternate  inflation  strategy  from  that  used  in  ABLE,  and  provides  a 
method  for  choosing  the  initial  block  size  m  in  Algorithm  7  automatically.  With  this  new  adaptive 
inflation  algorithm,  we  develop  an  adaptively  blocked  Lanczos  algorithm.  The  algorithm  uses  the 
adaptive  inflation  method  to  track  convergence  of  the  leading  eigenvalue  to  infer  the  size  of  the  local 
gaps.  Though  the  adaptive  inflation  method  in  Algorithm  9  is  greedy,  the  greediness  is  advantageous 
with  respect  to  compute  times;  the  cost  of  deflated  Lanczos  iteration  depends  on  the  dimension  of 
the  deflation  subspace.  If  the  deflation  subspace  has  m  dimensions,  then  each  Gram-Schmidt  step 
required  to  orthogonalize  a  new  search  direction  requires  nm  FLOPS. 

We  also  note  that,  assuming  an  eigenvalue  distribution  as  in  Figure  5.2,  inflation  will  terminate 
with  a  block  size  m  such  that  the  first  m  eigenvalues  of  A  are  much  larger  and  therefore  constitute 
a  larger  proportion  of  the  Frobenius  norm  of  A  than  others.  Therefore,  it  is  beneficial  to  approxi¬ 
mate  them  as  well  as  possible.  Better  approximations  may  be  attained  by  incorporation  of  a  power 
refinement  approach  as  in  the  direct  eigenvalue  method  in  Algorithm  3.  Then  deflated  single- vector 
Lanczos  iteration  generates  the  remaining  basis  vectors. 

The  algorithm  requires  a  tolerance  for  inflation  cutoff,  a  tolerance  for  deflation,  and  a  tracking 
parameter  w  that  determines  the  number  of  iterations  over  which  9‘f'  must  be  nonincreasing  before 
inflation  stops;  these  are  exactly  the  parameters  used  by  Algorithm  9.  After  line  1,  GrABL  has  an 
orthonormal  matrix  of  Ritz  vectors  V  with  m  columns,  with  span{y}  =  spanlAXol  for  a  random  m- 
column  matrix  Xq.  If  GrABL  is  passed  a  power  refinement  parameter,  then  V  is  replaced  with  a  new 
matrix  of  m  Ritz  vectors.  Then  deflated  single-vector  Lanczos  proceeds  using  the  infimal  Ritz  vector 
deflated  by  the  preceding  m  - 1  Ritz  vectors. 

The  optional  refinement  step  at  line  3  allows  for  further  improvement  of  the  initial  subspace.  This 
also  reflects  the  hybrid  approach  between  the  power-iteration  eigenvalue  approximation  methods 
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Algorithm  10  Greedy-adaptive  Lanczos  algorithm  with  random  projection  inflation 

Require;  inflation  cutoff  tolerance  p,  convergence  window  w  and  dimension  k',  optional  reflnement 

parameter  p 

1:  obtain  A^"^\V  from  Algorithm  9  run  with  w  =  w  and  p  =  p,  where  V  has  m  columns. 

2:  ifp  > 1  then 

3:  QR  factorize  QR  =  AP~^V 

4:  A  -  V'^AV 

5:  spectral  decompose  A  =  V  &V^ 

6:  end  if 

7:  X 

8:  Perform  k  —  m  +  1  single-vector  Lanczos  iterations  on  A,  deflated  by  X  using  Algorithm  8,  get 
Q,T,Z. 

9:  Q  =  [X,Q] 

A  z 

10:  T  = 

T 

11:  return  Q,T 
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in  [36]  and  traditional  Krylov  subspace  methods.  It  is  this  refinement  factor  that  allows  GrABL 
to  obtain  an  advantage  over  alternate  methods  such  as  single-vector  Lanczos  iteration;  we  have 
observed  that  GrABL  with  a  refinement  factor  of  p  >  1  produces  better  low-rank  approximations 
than  ordinary  Lanczos  iteration,  while  still  requiring  less  computational  resources  than  the  direct 
eigenvalue  approximation  method  in  Algorithm  3. 

Traditionally,  deflation  is  performed  to  resolve  breakdowns  due  to  loss  or  near- loss  of  linear  in¬ 
dependence  in  columns  of  A’'Xq  for  some  i.  Those  columns  that  have  lost  linear  independence  are 
removed  from  the  iteration  block,  and  each  successive  iteration  is  orthogonalized  against  them  to 
maintain  the  Lanczos  recurrence.  For  GrABL,  deflation  is  used  somewhat  differently,  and  different 
vectors  may  be  used.  We  are  deflating  not  against  Lanczos  vectors,  but  against  Ritz  vectors.  When 
deflation  with  Ritz  vectors  is  used  at  iteration  i,  maintaining  orthogonality  of  the  Lanczos  basis  re¬ 
quires  orthogonalization  against  all  i  —  1  previous  Lanczos  blocks  rather  than  only  those  that  have 
been  deflated.  When  deflation  is  performed  at  the  first  iteration,  the  Ritz  and  Lanczos  bases  coin¬ 
cide.  We  may  simply  discard  the  trailing  m  —  1  Ritz  vectors,  and  proceed  with  single-vector  iteration 
with  Ritz  vector  m  deflated  against  the  leading  m  -  1  Ritz  vectors  at  cost  equal  to  deflation  against 
Lanczos  vectors. 

Deflation  may  result  in  some  improvement  in  the  stability  of  future  Lanczos  iterations.  We  have 
a  relationship  between  the  spectral  norm  of  the  input  matrix  A  and  the  loss  of  orthogonality  in  the 
Lanczos  basis  [74].  In  our  observations,  we  have  witnessed  some  improvements  in  stability  due  to 
deflation,  but  the  improvements  were  not  large  or  consistent  enough  to  warrant  the  abandonment  of 
reorthogonalization  measures. 

As  noted  previously,  the  random  projection  algorithm  requires  compute  effort  asymptotic  with 
Oink‘d)  to  generate  a  ^-dimensional  near-minimal  Krylov  subspace  when  the  size  and  density  of 
the  input  matrix  is  fixed.  GrABL  has  a  compute  resource  advantage  over  the  random  projection 
algorithm  as  it  minimizes  the  cost  per  iteration  by  minimizing  the  block  size  as  early  as  possible. 
Its  computational  cost  is  equal  to  that  of  Algorithm  7,  but  the  parameter  m  is  minimized  via  greedy 
termination  of  the  block  size  inflation  process.  Moreover,  like  Algorithm  7,  immediate  deflation  to 
single-vector  Lanczos  iteration  minimizes  the  cost  of  each  iteration  of  Algorithm  8. 
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5.7  Conclusion 

Random  projections  may  be  effective  eigenvalue  approximation  methods  which  are  also  robust  to 
loss  of  orthogonality.  Though  complete  convergence  to  eigenvectors  may  be  slow  in  general,  random 
projections  still  produce  good  approximations  to  eigenvalues  with  only  a  few  stationary  iterations. 
They  may  be  used  for  finding  low-rank  matrix  approximations  when  exact  eigenvalues  are  not  re¬ 
quired.  Random  projection  methods  share  similarities  with  block  Krylov  subspace  methods,  which 
have,  in  single-vector  form,  also  been  used  for  relaxed  alternatives  to  the  truncated  SVD  for  low-rank 
matrix  approximation.  Random  projection  methods  do  have  two  disadvantages  with  respect  to  com¬ 
putational  costs:  the  dense  linear  algebra  operations  require  FLOPS  that  scale  quadratcially  with 
the  dimension  of  the  subspace.  The  method  also  requires  (2p  -t-  l)k  sparse  matrix- vector  products 
to  form  a  ^-dimensional  subspace;  this  is  problematic  when  sparse  matrix-vector  products  are  not 
inexpensive.  By  performing  random  projections  with  stationary  iteration  only  on  those  m  eigenpairs 
or  singular  triplets  that  contribute  most  to  the  Frobenius  norm  of  the  approximation,  computational 
costs  due  to  both  sparse  matrix  multiplication  and  dense  matrix  operations  can  be  reduced.  Generat¬ 
ing  the  remaining  k  —  m  subspace  dimensions  with  single-vector  Lanczos  iteration  can  further  reduce 
compute  costs.  In  this  case,  only  2mp  +  k  +2m  sparse  matrix- vector  multiplications  are  necessary, 
and  the  cost  for  dense  matrix  operations  is  no  longer  asymptotically  quadratic  with  subspace  size  k. 

Though  the  FLOP  cost  analysis  of  the  random  projection  is  somewhat  straightforward,  it  is  not 
trivial  to  determine  the  optimal  m  for  changeover  from  random  projections  to  single-vector  Lanczos 
iteration.  We  have  presented  a  method  that  chooses  m  greedily,  and  based  it  on  an  assumption  that 
eigenvalue  clustering  becomes  tight  in  a  somewhat  uniform  way.  Though  clustering  of  eigenvalues 
does  imply  that  improvements  in  eigenvalue  convergence  solely  due  to  increasing  block  size  will 
stagnate,  the  greedy  approach  may  prematurely  deflate  the  working  subspace.  Though  we  argue 
that  eager  deflation  is  beneficial  from  a  computational  cost  standpoint,  smaller  approximation  norms 
may  be  realized  by  prolonged  block  inflation.  It  would  be  useful  for  future  efforts  to  characterize 
the  convergence  of  eigenvalue  approximations  in  a  hybrid  subspace  so  that  worst-case  eigenvalue 
convergence  in  GrABL  may  be  compared  to  other  Krylov  subspace  methods. 
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Applications  of  GrABL 


To  illustrate  the  performance  of  GrABL,  we  present  results  from  numerical  experiments.  These 
experiments  mirror  those  presented  in  Chapter  4,  and  are  intended  to  enable  and  engender  compari¬ 
son  between  the  short  block  Krylov  subspaces  methods  and  GrABL.  As  in  the  experiments  presented 
previously  in  Chapter  4,  the  focii  of  the  experiments  may  be  broken  into  two  categories:  experi¬ 
ments  to  demonstrate  the  norm  and  stability  of  low-rank  matrix  approximations  generated  with  the 
methods  under  consideration,  and  the  actual  performance  of  data  analysis  methods  —  for  example 
LSI  —  when  using  low-rank  approximations  generated  by  the  competing  low-rank  approximation 
methods.  The  comparison  of  low-rank  approximation  error  is  relevant,  as  it  clearly  shows  the  dif¬ 
ferences  between  the  low-rank  approximation  methods.  However,  some  data  analysis  methods  that 
use  low-rank  matrix  approximations  are  not  tightly  coupled  with  the  Frobenius  or  nuclear  norm; 
a  strictly  smaller  low-rank  approximation  error  does  not  necessarily  imply  improved  performance. 
For  example,  the  effects  of  PGA  on  a  “downstream”  classifier  may  actually  improve  performance 
due  to  filtering.  The  projection  acts  as  a  filter,  separating  noise  from  signal.  The  performance  of 
the  “downstream”  consumer  of  the  low-rank  approximation  is  intended  to  directly  show  the  impact 
of  the  different  low-rank  matrix  approximation  techniques  on  methods  consuming  low-rank  matrix 
approximations. 

To  study  the  approximation  error,  we  generated  low-rank  approximations  of  a  selection  of  the 
same  data  used  in  the  experiments  demonstrating  the  performance  of  short  block  Krylov  subspaces: 
the  extended  Yale  eigenface  data  [45,49],  an  inverted  graph  Laplacian  matrix  representing  the  road 
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network  of  Colorado  derived  from  the  US  Census  Bureau’s  tiger-line  CIS  data  [1],  and  a  term- 
document  matrix  from  the  UCI  machince  learning  repository’s  bag-of- words  collection  [26].  In  each 
experiment,  we  generate  a  ^-dimensional  approximation  to  A,  compare  the  results  of  the  random  pro¬ 
jection  algorithm  in  3,  GrABL  with  p  =  10“^  and  window  parameter  m  =  3,  ABLE,  and  single-vector 
Lanczos.  Runs  of  Algorithm  7  with  m  =  kl2  were  also  performed  to  provide  comparison  against  the 
greedy  block  size  deflation  of  GrABL.  Start  and  inflation  vectors  were  drawn  from  a  uniform  dis¬ 
tribution.  Runs  with  no  reorthogonalization  and  with  full  reorthogonalization  were  performed  with 
GrABL  and  Lanczos;  only  the  full  reorthogonalization  results  were  used  for  comparing  low-rank 
approximations.  The  runs  performed  without  any  reorthogonalization  are  intended  to  compare  the 
stability  of  GrABL  and  single-vector  Lanczos  iteration.  Again,  we  compare  the  approximation  er¬ 
ror  II A  —  A^^^ll  for  each  method,  and  analyze  the  magnitudes  of  the  Ritz  values  to  draw  conclusions 
regarding  the  differing  performances  of  the  algorithms.  We  consider  the  inflation  and  deflation  his¬ 
tories  of  the  adaptive  algorithms.  Due  to  adaptive  inflation,  ABLE  may  produce  excess  Ritz  pairs; 
the  Krylov  subspace  was  truncated  using  the  top  k  Ritz  pairs.  The  experiments  were  performed  on 
one  of  the  same  computers  as  the  experiments  in  Chapter  4:  a  MacPro  with  2  quad-core  Intel  Xeon 
processors  running  at  3  Ghz.  The  computer  had  8  GB  or  ram,  and  was  running  Mac  OS  X  10.6.8. 
The  experiments  were  run  using  numpy  linked  against  Apple’s  vecLib  tuned  LAPACK  and  BLAS 
implementations. 

To  study  the  results  in  applications  that  use  low-rank  matrix  approximations  generated  by  the 
methods  considered  herein,  we  used  the  NPL  information  retrieval  problem  [71]  and  the  visualiza¬ 
tion  of  the  Colorado  road  network  Laplacian  in  the  commute-time  embedding  [70]. 

6.1  Experiments  with  the  Yale  face  data 

Recalling  the  eigenfaces  experiments  in  Section  4.1,  the  eigenface  method  [80]  is  an  application  of 
PGA  to  the  facial  recognition  problem.  Given  a  set  of  training  images,  PGA  is  used  to  produce  a  small 
subspace  that  contains  the  features  needed  for  face  classification.  As  in  the  experiments  evaluating 
the  “shrink-and-iterate”  and  hybrid  methods,  we  flattened  each  image  into  a  vector  ai,  stacked  them 
into  a  matrix  [ai  02  . . .]  =  A'  and  set  A  =  A'  —  ^  where  ^  is  the  arithmetic  mean  of  all  images.  This 
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Figure  6.1:  Eigenvalue  gaps  for  A^A  for  the  Yale  face  data  (left),  and  fi,r/7i,r+i  (right). 


Figure  6.2:  Values  for  for  the  Yale  face  data. 

corresponds  to  a  centered  data  matrix  as  would  be  used  for  principal  components  analysis.  We  then 
approximated  the  Gram  (or  covariance)  matrix  A^A.  Eigenvalues  of  A^A  were  presented  in  a  log- 
scale  plot  in  Figure  5.1.  The  eigenvalue  gaps  and  values  for  are  shown  in  Figure  6.1. 

The  leading  four  eigenvalues  of  A^A  are  well- separated  and  the  leading  two  are  separated  from 
the  rest  by  nearly  an  order  of  magnitude.  When  run  with  p  =  10“^  and  deflation  window  m  =  3, 
GrABL  chose  a  block  size  of  4.  Figure  6.2  shows  the  convergence  of  from  Algorithm  3,  which  fol¬ 
lows  the  decreasing  values  for  fi,r^Yi,r+i-  Therefore,  for  reduction  to  4  or  fewer  dimensions,  GrABL  is 
equivalent  to  the  random  projection  algorithm.  Beyond  4  dimensions,  GrABL  produces  approxima¬ 
tions  with  Frobenius  norm  close  to  the  random  projection  method.  Figure  6.3  shows  the  Frobenius 
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Figure  6.3:  Low-rank  approximation  errors  of  Yale  data;  GrABL  v.  random  projections  (top  left);  GrABL  v. 
single-vector  and  Algorithm  7  with  m  -  kl2  (top  right),  ABLE  v.  random  projections  (bottom  left)  and  GrABL 
V.  random  projections  with  refinements  (bottom  right). 

norm  of  the  low-rank  approximation  for  various  dimensions  k.  Using  a  refinement  factor  of  3 
improves  approximation  quality  further.  Though  the  block  size  chosen  by  GrABL  was  smaller  than 
kl2  for  large  dimensions,  Algorithm  7  with  m  =  kl2  did  not  produce  substantially  smaller  approxima¬ 
tion  errors.  Due  to  the  eigenvalue  distribution,  diminishing  returns  are  rapidly  encountered,  even 
when  exact  eigenpairs  are  used  for  projection.  Though  this  might  appear  to  suggest  that  only  the 
first  5  eigenvectors  are  needed  for  the  dimension  reduction,  the  resulting  face  images  are  substan¬ 
tially  improved  by  adding  successive  dimensions  to  the  projection.  We  observe  that  GrABL  exhibits 
low-rank  approximation  error  comparable  to  the  random  projection  method  for  all  dimensions,  and 
that  single-vector  Lanczos  lags  behind  both  GrABL  and  the  random  projection  method  for  small 
dimensions  before  achieving  parity  for  higher  dimensions. 
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Eigenvalue  No.  Eigenvalue  No. 

Figure  6.4:  Ritz  values  from  minimal  12-dimensional  subspaces  for  GrABL,  Lanczos,  ABLE  and  random  pro¬ 
jections. 

To  further  show  sources  of  the  error  lag  between  single  vector  Lanczos  and  GrABL/random  pro¬ 
jection  for  small  dimensions,  we  show  the  Ritz  values  used  for  the  12-dimensional  approximation  in 
Figure  6.4.  All  algorithms  approximate  the  leading  eigenvalues  of  A^A  comparably  well;  however, 
single  vector  Lanczos  has  substantial  error  in  its  approximation  of  the  tenth  eigenvalue.  GrABL 
approximates  this  eigenvalue  better.  The  poorer  approximation  of  the  trailing  eigenvalues  is  a  trend 
evident  in  Lanczos  approximations  over  all  dimensions.  ABLE  approximates  the  5  leading  eigenval¬ 
ues  well,  but  has  significant  gaps  for  the  trailing  ones. 

ABLE  was  run  with  t]  =  15(Ai  —  A2)  in  order  to  include  the  first  2  eigenvalues.  This  value  of  77  was 
chosen  to  produce  a  block  size  of  4,  equivalent  to  the  block  size  found  by  GrABL.  ABLE  produces  the 
largest  approximation  error  of  all  algorithms.  This  performance  may  be  attributed  to  the  inflation 
strategy  of  ABLE;  ABLE  simply  uses  random  vectors  to  inflate  the  Krylov  subspace.  Random  vectors 
initially  will  have  a  small  cosine  angle  between  leading  eigenvectors  of  A.  Table  6.1  shows  the 
block  size  history  of  ABLE  and  GrABL  for  the  first  5  iterations;  block  size  does  not  change  after  the 
fourth  iteration  for  either  algorithm.  There  is  a  noticeable  lag  in  approximation  error  for  ABLE  at 
4  dimensions,  which  is  when  it  inflates  the  Krylov  subspace.  ABLE  is  not  intended  to  be  a  minimal 
Krylov  subspace  method,  so  this  feature  of  the  algorithm  is  ordinarily  not  an  issue.  For  minimal 
Krylov  subspaces,  a  different  inflation  strategy  or  different  inflation  vectors  may  produce  better 
results. 
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Iteration 

1 

2 

3 

4 

5 

ABLE 

1 

2 

3 

3 

3 

GrABL 

4 

1 

1 

1 

1 

Table  6.1:  Block  size  history  of  ABLE  and  GrABL  run  on  the  Yale  data  for  the  first  5  iterations. 
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Figure  6.5:  Maximum  vector  cosine  between  Lanczos  basis  vectors  for  GrABL  and  single-vector  Lanczos  when 
run  without  reorthogonalization. 

The  random  projection  method  produces  good  approximations  when  operating  on  A^;  it  closes 
much  of  the  gap  between  the  random  projection  method  on  A  and  the  spectral  approximation.  With 
refinement,  GrABL  produces  near-parity  with  the  random  projection  approximation  operating  on  A^ 
for  smaller  dimensions.  GrABL  also  performs  far  fewer  passes  over  the  data;  which  is  significant 
when  matrix  multiplications  are  not  trivial.  This  better  approximation  both  decreases  the  error  of 
the  approximation  and  decreases  loss  of  orthogonality  in  GrABL. 

Loss  of  orthogonality  is  a  difficulty  for  all  Lanczos  algorithms.  We  ran  both  single  vector  Lanczos 
and  GrABL  without  any  reorthogonalization  to  study  the  loss  of  orthogonality.  Figure  6.5  shows  the 
maximum  vector  cosine  between  Lanczos  basis  vectors  up  to  20  dimensions.  GrABL  demonstrates 
slower  loss  of  orthogonality,  but  still  loses  orthogonality  eventually.  Power  iteration  refinement  pro¬ 
vided  a  marginal  improvement  in  stability.  ABLE  has  similar  stability  to  single-vector  Lanczos 
initially,  but  loses  orthogonality  after  the  first  inflation,  likely  due  to  convergence  of  the  principal 
eigenpair. 
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eigenvalue  index  eigenvalue  index 


Figure  6.6:  Leading  100  eigenvalues  of  the  inverted  graph  Laplacian  matrix  for  the  Colorado  road  netw^ork 
(left)  and  eigenvalue  gaps  (right). 


dimension  dimension 


Figure  6.7:  Values  of  (left)  and  A*j*”**  (right)  for  the  inverted  Colorado  road  network  Laplacian. 

6.2  Experiments  with  the  Colorado  rood  network  Laplacian 

We  also  performed  experiments  on  the  road  network  graph  from  the  state  of  Colorado,  motivations 
for  study  of  graph  Laplacian  problems  are  detailed  in  Section  4.3.  We  study  the  road  network  of 
Colorado,  as  obtained  from  the  US  Census  Bureau’s  TIGER-Line  GIS  data  [1].  We  used  edge  weights 
representing  distance  between  vertices  and  replaced  all  2-degree  vertices  with  an  edge.  We  then 
extracted  the  largest  connected  component.  Again,  the  resulting  matrix  is  345,025  x  345,025.  The 
leading  100  eigenvalues  of  the  reciprocated  Laplacian  matrix  are  shown  in  Figure  6.6. 

The  leading  eigenvectors  of  L'*'  are  well-separated,  but  the  rate  of  decay  in  gaps  is  not  as  sharp 
as  for  the  Yale  face  data.  The  slower  rate  of  decay  of  eigenvalues  in  the  Laplacian  influences  the 
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Figure  6.8:  Low-rank  approximation  errors  of  Colorado  graph  Laplacian;  GrABL  v.  random  projections  (top 
left);  GrABL  v.  single-vector  (top  right),  ABLE  v.  random  projections  (bottom  left)  and  GrABL  v.  random 
projections  with  refinements  (bottom  right). 

convergence  of  eigenvalues  in  Krylov  subspaces,  and  the  relative  merits  of  GrABL  to  the  random 
projection  algorithm  to  single-vector  Lanczos.  The  low-rank  approximations  of  L'*'  are  shown  in 
Figure  6.8.  Though  these  approximations  also  show  diminishing  returns  with  added  dimensions,  the 
more  slowly  deca3dng  gaps  lead  to  a  larger  initial  block  size.  With  p  =  10“^  and  m  =  3,  GrABL  chose 
a  block  size  of  8.  ABLE  was  run  with  rj  =  1.5(Ai  -  A2),  again  to  arrive  at  a  block  size  of  8,  as  was  found 
by  GrABL.  Note  that  Algorithm  7  did  not  produce  significantly  smaller  errors  than  GrABL  at  any 
dimension,  even  though  it  used  m  =  kl2. 

The  overall  trends  between  GrABL,  the  random  projection  method  and  single-vector  Lanczos  are 
similar  to  the  trends  observed  in  the  Yale  face  data  experiments;  GrABL  produces  approximations 
with  similar  norm  to  the  random  projection  method,  while  single-vector  Lanczos  lags  initially  but  ob- 
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Table  6.2:  Block  size  history  of  ABLE  and  GrABL  for  the  first  7  iterations  on  the  Colorado  road  network. 
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Figure  6.9:  Maximum  vector  cosine  between  Lanczos  basis  vectors  for  GrABL  and  single-vector  Lanczos  when 
run  without  reorthogonalization. 

tains  equivalent  performance  at  higher  dimensions.  The  difference  in  approximation  error  between 
the  spectral  approximation  and  the  random  projection  approximation  is  larger,  but  is  closed  by  ran¬ 
dom  projections  operating  on  Again,  GrABL  with  p  =  3  performs  nearly  as  well  as  the  random 

projection  method  operating  on  but  the  advantage  of  random  projections  on  L'*'^  over  GrABL  is 
larger  than  in  the  Yale  experiment.  ABLE  again  produces  the  largest  approximation  error.  Table  6.2 
shows  the  block  size  histories  of  ABLE  and  GrABL  for  the  first  7  iterations.  ABLE  only  required  7 
iterations,  and  produced  1  excess  Lanczos  vector.  Note  that  if  a  subspace  between  50  and  80  were 
required,  ABLE’s  block  size  of  29  will  produce  a  large  number  of  excess  basis  vectors  that  would  be 
discarded. 

Orthogonality  is  also  lost  rapidly  for  this  graph  Laplacian,  though  not  as  quickly  as  for  the  Yale 
data.  Figure  6.9  shows  the  maximum  vector  cosine  for  Lanczos  basis  vectors.  Adding  refinement  to 
GrABL  does  not  result  in  an  improvement  in  stability  in  this  case. 

The  inverted  graph  Laplacian  induces  the  commute-time  embedding.  For  certain  classes  of 
graphs,  the  commute-time  embedding  reveals  interesting  and  meaningful  structure.  The  Colorado 
road  network  graph  is  such  a  graph  for  which  the  commute-time  embedding  is  useful.  The  commute 
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Figure  6.10:  Grand  tour  of  first  20  dimensions  of  commute-time  embedding  of  Colorado  Road  network  graph. 

/ 

Figure  6.11:  Grand  tour  of  first  20  dimensions  of  commute-time  embedding  of  GrABL-produced  low-rank 
approximation  of  the  Colorado  Road  network  graph  using  a  refinement  factor  of  2. 

time  embedding  is  defined  in  terms  of  the  pseudoinverse  L'*'  of  the  graph  Laplacian  L  with  spectral 
decomposition 

(6.1)  [7A[7^=L+. 

Columns  of  U give  coordinates  of  vertices  of  the  graph  in  the  commute-time  embedding.  Good 
approximations  of  eigenvalues  and  eigenvectors  of  L"*"  are  necessary  for  good  approximations  of  the 
commute-time  embedding  for  visualization  purposes.  We  therefore  used  modest  power  refinement 
factors  for  the  algorithms  under  consideration;  for  GrABL  and  random  projections,  we  used  a  refine¬ 
ment  factor  of  p  =  2.  Figure  6.10  shows  the  grand  tour  of  the  first  20  dimensions  of  the  commute 
time  embeddings  of  L"*"  using  eigenvectors,  Figure  6.11  shows  the  grand  tour  of  the  commute-time 
embedding  of  the  GrABL-produced  low-rank  approximation  and  Figure  6.12  shows  the  commute¬ 
time  embedding  of  the  random  projection-produced  low-rank  approximation.  All  20  dimensions  of 
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Figure  6.12:  Grand  tour  of  first  20  dimensions  of  commute-time  embedding  of  random  projections-produced 
low-rank  approximation  of  the  Colorado  Road  network  graph. 

the  commute-time  embedding  in  Figure  6.10  exhibit  definite  structure.  All  of  GrABL,  static-block 
Lanczos  and  random  projections  approximate  the  leading  10  dimensions  without  any  noticeable  vi¬ 
sual  error.  The  remaining  10  dimensions  are  also  approximated  with  some  accuracy,  though  the 
random  projection  method  approximates  the  trailing  10  dimensions  somewhat  better  than  GrABL. 
Nevertheless,  GrABL  is  intended  to  be  less  costly  than  random  projections  with  refinement. 

As  this  inverted  graph  Laplacian  matrix  is  sparse,  even  in  its  factored  form,  the  extra  overhead 
for  refinement  is  smaller  than  for  the  dense  Yale  face  data.  We  accounted  for  partial  reorthogonaliza- 
tion  in  single-vector  Lanczos  iteration  in  ABLE  and  GrABL,  assuming  that  Ritz  vectors  have 
converged  at  iteration  i.  The  FLOP  counts  for  generating  a  subspace  of  up  to  64  dimensions  is  shown 
in  Figure  6.13.  We  present  both  the  total  FLOP  counts,  and  the  FLOPS  due  to  sparse  and  dense  oper¬ 
ations.  These  approximations  correspond  the  norms  of  the  approximations  in  the  bottom-right  plot  of 
Figure  6.8.  Some  overhead  is  evident  for  GrABL  with  refinements  at  low  dimensions;  however,  this 
overhead  is  eventually  overwhelmed  by  the  superior  computational  complexity  of  GrABL.  GrABL 
only  has  to  perform  random  projections  on  a  block  size  of  8.  For  larger  dimensions,  the  advantage 
of  GrABL  over  random  projections  is  evident.  Note  that  slope  of  the  lines  in  the  GrABL  chart  in 
Figure  6.13  changes  after  deflation  occurs  at  a  block  size  of  8;  this  is  due  to  the  decreased  cost  of  each 
successive  iteration  from  block  size  deflation.  The  may  be  contrasted  with  the  constant  slope  of  the 
lines  in  the  comparable  FLOP  charts  in  Figure  4.13. 
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Figure  6.13:  FLOP  counts  for  generating  a  low-rank  approximation  of  with  random  projections  (top  left), 
GrABL  with  p  =  3  (top  right)  and  ordinary  single-vector  Lanczos  for  various  dimensions. 
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Figure  6.14:  Leading  100  eigenvalues  of  A^A  for  the  Enron  email  corpus  (left)  and  eigenvalue  gaps  (right). 
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Table  6.3:  Block  size  history  of  ABLE  and  GrABL  for  the  first  11  iterations  on  the  Enron  corpus. 


6.3  Experiments  with  the  Bag  of  Words  term-document  matrix 

The  truncated  SVD  is  used  in  information  retrieval  applications  as  part  of  LSI.  As  in  the  experiments 
presented  in  Section  4.2,  we  used  the  Enron  email  corpus  which  contained  39,861  documents  and 
28,099  unique  terms.  The  algorithms  in  consideration  herein  were  applied  to  the  matrix  A^A.  The 
spectrum  of  A^A  is  shown  in  Figure  6.14.  The  matrix  A^A  has  a  flatter  spectrum  than  either 
the  eigenface  data  or  the  inverted  Colorado  road  network  graph  Laplacian.  The  eigenvalue  gaps 
influence  the  rate  of  convergence  of  eigenvalue  approximations.  The  trend  apparent  between  the 
Yale  face  data  and  Colorado  road  network  Laplacian  is  continued  in  this  experiment;  smaller  local 
gaps  lead  to  a  smaller  advantage  of  maximal  block  sizes.  Figure  6.15  shows  improvement  in  worst- 
case  bounds  due  to  block  size  inflation  and  the  actual  convergence  of 

The  low-rank  approximations  are  shown  in  Figure  6.16.  The  smaller  local  gaps  imply  a  de¬ 
creased  initial  convergence  advantage  to  block  size.  In  contrast  to  the  Yale  face  data  experiment, 
GrABL  and  the  random  projection  method  hold  no  particular  advantage  over  single-vector  Lanczos, 
even  at  small  dimensions.  GrABL’s  error  advantage  over  single-vector  Lanczos  is  only  slight.  ABLE 
produces  larger  approximation  errors  than  all  other  methods,  as  was  the  case  in  the  preceding  ex- 
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Figure  6.15:  Values  of  7i,r/f  i,r+i  deft)  and  A*-|*”*'  (right)  for  the  Enron  email  corpus. 


Figure  6.16:  Low-rank  approximation  errors  of  Enron  document  corpus;  GrABL  v.  random  projections  (top 
left);  GrABL  v.  single-vector  (top  right),  ABLE  v.  random  projections  (bottom  left)  and  GrABL  v.  random 
projections  with  refinements  (bottom  right). 
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Figure  6.17:  Maximum  vector  cosine  between  Lanczos  basis  vectors  for  GrABL,  ABLE  and  single-vector  Lanc¬ 
zos  when  run  without  reorthogonalization  (left)  and  Ritz  values  from  minimal  13-dimensional  subspaces  for 
GrABL,  Lanczos  and  ABLE  (right). 


periments.  Likewise,  Algorithm  7  with  m  =  kl2  did  not  produce  smaller  approximation  errors  than 
GrABL,  despite  using  a  larger  block  size.  Application  of  random  projections  to  (A^A)^  improved 
the  performance  of  the  random  projection  method  substantially  as  did  the  analogous  refinements 
applied  to  GrABL.  ABLE  was  run  with  r]  =  8(Ai  -  A2)  to  produce  a  block  size  history  most  comparable 
to  GrABL.  The  block  size  histories  of  ABLE  and  GrABL  are  shown  in  Table  6.3. 

The  smaller  local  gaps  lead  to  slower  convergence  of  eigenvalues,  but  this  also  implies  more 
stable  iteration.  Figure  6.17  shows  the  loss  of  orthogonality  over  iterations  for  single-vector  Lanczos, 
ABLE  and  GrABL  with  eigenvalue  estimates  from  the  13-dimension  subspace.  Though  its  trailing 
eigenvalue  estimates  have  smaller  magnitude  than  GrABL,  ABLE  does  approximate  the  leading 
eigenvalue  slightly  better  than  single-vector  Lanczos  at  10  dimensions;  the  residual  is  less  than 
1.1  X  10“^  compared  to  6.4  x  10“^  for  single-vector  Lanczos.  The  better  approximation  of  the  leading 
eigenvalue  implies  faster  loss  of  orthogonality.  GrABL  with  refinements  also  has  somewhat  faster 
loss  orthogonality  than  single-vector  Lanczos. 

The  enron  corpus  problem  is  also  sufficiently  large  to  produce  meaningful  FLOP  comparisons.  As 
in  the  experiments  with  the  Colorado  road  network,  partial  reorthogonalization  was  used  to  maintain 
orthogonality  of  the  Lanczos  basis.  The  FLOP  counts  for  generating  a  low-rank  approximation  of  the 
term-document  matrix  with  eigenvalue  approximations  is  shown  in  Figure  6.18.  The  plot  compares 
the  same  norm  case  as  is  shown  in  the  lower-right  plot  in  Figure  6.16.  The  superior  computational 
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Figure  6.18:  FLOP  counts  for  generating  low-rank  approximations  of  the  term-document  Gram  matrix  A^A 
for  the  enron  email  corpus.  The  random  projection  method  (top  right),  GrABL  with  p  =  3  (top  right)  and 
single-vector  Lanczos  (bottom)  are  compared. 
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complexity  of  GrABL  is  also  evident  in  the  scaling  results  of  this  experiment,  and  again,  there  is  a 
reduction  in  the  slope  of  the  line  for  FLOPS  for  generating  a  subspace  of  fixed  dimension. 


6.4  Information  retrieval  with  NPL  data 


To  study  the  effects  of  application  of  SVD  approximation  to  an  information  retrieval  task,  we  consider 
the  NPL  data,  the  same  problem  we  studied  in  Section  4.2.  The  resulting  term-document  matrix  has 
7,491  columns  and  11,429  rows.  We  applied  term  frequency  and  inverse  document  frequency  scaling 
to  the  entries  in  the  term-document  matrix,  and  generated  low-rank  approximations  for  k  =  2’^ 
with  2  <  i  <8.  Each  query  was  applied  against  the  low-rank  approximation,  and  documents  were 
scored  for  relevance  using  vector  angle  cosines.  We  computed  document-level  averages  for  each  query 
at  100;  the  document-level  average  is  simply  the  average  percentage  of  relevant  documents  retrieved 
in  the  first  100  query  results.  Figure  6.19  shows  the  query  performance  for  the  random  projection, 
GrABL  approximations  and  LSI,  and  the  Frobenius  norms  of  the  low-rank  approximations  for  all 
three  methods. 

There  are  multiple  methodologies  for  application  of  LSI-like  dimension  reduction  to  querying 
problems.  As  in  the  previous  experiments  with  the  NPL  data,  we  treat  the  subspace  projection  as  a 
query  expander,  and  the  cosine  formula  is 


(6.2) 


cos(ai,q) 


<a;,F^q) 

a^llllFF^qll' 


Blom  and  Ruhe  observed  that  treatment  of  Krylov  subspace  methods  as  query  expanders  rather  than 
dimension  reducers  produced  better  precision  and  recall  in  [10].  We  witnessed  the  same  behavior, 
and  show  results  evaluating  performance  using  query  expansion  cosines.  From  Figure  6.19,  it  is 
evident  that  random  projections  produce  better  query  results  than  GrABL  at  lower  dimensions.  In 
fact,  random  projections  outperform  LSI  as  well.  This  may  be  understood  in  terms  of  the  different 
subspaces  being  used  as  query  expanders.  Since  extremal  eigenvalues  converge  quickly  in  a  Krylov 
subspace,  a  minimal  Krylov  subspace  will  include  both  latent  feature  vectors  that  represent  “signal” 
from  the  leading  part  of  the  spectrum  and  will  also  include  latent  feature  vectors  representing  “noise” 
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dimensions  dimensions 


Figure  6.19:  Document-level  averages  for  random  projection  and  GrABL  approximations  of  LSI  (left)  and 
low-rank  approximation  norm  errors  (right)  for  the  NPL  collection. 

from  the  trailing  part  of  the  spectrum.  Random  projections  are  good  at  excluding  trailing  parts  of  the 
spectrum.  The  inclusion  of  trailing  spectral  features  may  be  remedied  by  use  of  a  longer,  non-minimal 
Krylov  subspace  followed  by  Ritz  vector  truncation.  Use  of  subspaces  as  query  expanders  implies 
that  there  is  an  optimal  subspace  for  query  expansion  which  may  be  distinct  from  the  norm-optimal 
projection  space  for  some  k',  simply  projecting  the  query  into  the  Frobenius  norm-optimal  rank  k 
projection  may  not  be  query-expansion  optimal  for  some  k.  Simply  producing  a  better  approximation 
of  the  truncated  SVD  for  rank  k  may  not  translate  into  more  relevant  search  results.  Therefore,  the 
relationship  between  approximation  error  and  query  precision  is  not  as  clear  as  it  is  for  applications 
for  which  there  is  a  close  relationship  between  Frobenius  norm  and  realized  error,  such  as  for  the 
previous  road  network  visualization  problem. 

6.5  Discussion 

For  all  three  data  sets,  GrABL  produces  low-rank  approximations  that  approach  the  accuracy  of  the 
random  projection  method.  For  some  dimensions,  GrABL  produces  slightly  better  approximations 
than  the  random  projection  method.  Single-vector  Lanczos  method  lags  behind  for  small  dimensions, 
but  does  provide  equivalent  approximations  for  higher  dimensions.  Both  single-vector  Lanczos  and 
GrABL  will  require  stabilization  at  some  point  in  its  iteration  to  prevent  loss  of  orthogonality  of 
the  Lanczos  basis.  The  random  projection  method  never  lost  any  orthogonality  in  any  experiment. 
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Though  loss  of  orthogonality  is  an  issue,  selective  reorthogonalization  may  be  applied  to  stabilize  the 
Lanczos  process  [57]  with  costs  less  than  full  reorthogonalization. 

The  approximation  errors  from  GrABL-produced  A^^^s  are  attributable  to  good  eigenvalue  ap¬ 
proximations.  GrABL  excludes  Ritz  pairs  that  approximate  trailing  eigenpairs,  as  does  the  random 
projection  algorithm.  This  quality  of  GrABL  may  be  accredited  to  the  start  block.  GrABL  incorpo¬ 
rates  the  random  projection  algorithm  in  its  initial  inflation  stage;  therefore,  it  always  begins  with  a 
block  of  Ritz  vectors  that  have  small  angles  with  trailing  eigenvectors  of  A.  This  start  block  also  ex¬ 
cludes  null  eigenvectors  of  A.  GrABL  may  have  better  stability  than  single-vector  Lanczos  in  some 
but  not  all  cases.  The  cause  for  better  stability  is  not  clear,  but  may  be  due  to  the  shortening  of 
the  Krylov  sequence  induced  by  the  start  block  that  is  later  deflated.  Adding  power  iteration  re¬ 
finements  to  GrABL  may  or  may  not  improve  stability;  for  the  Yale  face  experiment,  stability  was 
improved  with  refinements,  but  for  the  Colorado  road  graph  Laplacian,  stability  was  equivalent.  For 
the  Bag  of  Words  data,  adding  refinements  worsened  orthogonality  loss  for  some  dimensions.  ABLE 
also  showed  varying  stability  histories;  it  lost  orthogonality  quickly  in  the  Yale  experiment,  was  more 
stable  at  large  dimensions  than  all  others  for  the  Colorado  experiment,  but  lost  orthogonality  quickly 
in  the  Enron  corpus  experiment.  Block  size  history  is  influential;  if  many  of  the  dimensions  ABLE 
produces  are  due  to  inflation  rather  than  iteration,  then  ABLE  will  be  have  better  stability  but  worse 
approximation.  Inflation  may  help  to  postpone  loss  of  orthogonality,  but  cannot  prevent  convergence 
of  the  lead  eigenpair  from  corrupting  the  orthogonality  of  the  Lanczos  basis. 

We  have  observed  that  the  power  parameter  p  in  GrABL  is  important  for  obtaining  low-rank 
approximation  errors  that  are  smaller  than  random  projections.  A  key  advantage  of  GrABL  is  that 
it  only  performs  extra  sparse  matrix-vector  products  due  to  power  iteration  for  large  leading  eigen¬ 
values;  thereby  computational  costs  are  reduced  when  sparse  matrix-vector  products  are  expensive. 
Adding  extra  power  iteration  can  improve  the  performance  of  GrABL  substantially.  When  the  spec¬ 
trum  is  favorable,  as  with  the  Yale  face  data,  GrABL  with  refinements  performs  nearly  as  well  as  the 
random  projection  algorithm,  but  has  far  lower  computational  complexity  and  requires  fewer  passes 
through  the  data  to  perform  extra  power  iteration.  However,  the  experiments  with  the  Colorado 
road  network  graph  and  Enron  email  corpus  show  that  when  the  eigengaps  of  the  spectrum  shrink 
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more  slowly,  then  the  error  difference  between  GrABL  with  refinements  and  the  comparable  random 
projection  method  on  AP  grows  faster.  Nevertheless,  adding  refinements  always  produced  better  ap¬ 
proximations  than  single-vector  Lanczos,  random  projections  on  A  or  GrABL  without  refinements. 

Choosing  the  value  of  77  may  appear  to  have  been  done  somewhat  arbitrarily;  however,  it  was 
chosen  to  produce  a  comparison  against  GrABL  that  had  the  most  equivalence.  The  lead  eigengap 
was  scaled  in  an  attempt  to  include  those  eigenvalues  the  are  to  the  left  of  point  t'  in  Figure  5.2. 
ABLE  is  sensitive  to  the  choice  of  77;  small  variation  in  77  resulted  in  different  inflation  behavior. 
For  example,  using  the  Yale  face  data  and  77  =  14,  ABLE  did  not  inflate  at  all  to  produce  a  50- 
dimensional  Krylov  subspace,  rendering  it  identical  to  single-vector  Lanczos.  Using  77  =  17  caused 
ABLE  to  inflate  as  much  as  possible  at  every  iteration  but  the  last  to  produce  a  50-dimensional 
subspace.  Similar  sensitivity  was  observed  with  the  Colorado  road  network  Laplacian  and  Enron 
email  corpus  experiments. 

In  terms  of  compute-time,  GrABL  enjoys  a  theoretical  advantage  over  the  random  projection 
method  due  to  smaller  block  sizes.  For  example,  to  compute  a  20-dimensional  subspace  on  the  Yale 
face  data,  random  projections  would  require  costs  proportional  to  40071,  where  n  is  the  size  of  the 
covariance  matrix.  GrABL  would  require  (I6-1-  16)7i:  16  for  the  initial  block  size  of  4  and  a  further  16 
for  the  single-vector  iterations.  These  theoretic  advantages  may  not  be  completely  realized,  as  block 
methods  may  better  exploit  locality  and  parallelism.  The  unavoidable  costs  of  stabilization  are  also 
not  considered.  Nevertheless,  one  may  expect  GrABL  to  be  meaningfully  faster  than  the  random 
projection. 

Several  factors  influence  the  choice  of  algorithm  between  single-vector  Lanczos,  GrABL  and  the 
random  projection  method.  We  focus  on  the  case  in  which  sparse  matrix-vector  products  are  not  pro¬ 
hibitively  expensive  due  to  the  sparsity  of  A.  In  these  cases,  GrABL  may  use  refinement  without  cost 
penalties,  and  obtain  a  non-trivial  cost  advantage  over  the  random  projection  method,  while  main¬ 
taining  an  approximation  and  possible  stability  advantage  over  single-vector  Lanczos.  Nevertheless, 
when  sparse  matrix-vector  products  are  the  dominant  cost,  GrABL  enjoys  an  advantage  over  random 
projections  with  the  same  power  parameter  due  to  the  smaller  number  of  sparse  matrix-vector  prod¬ 
ucts.  The  approximation  advantage  over  single-vector  Lanczos  diminishes  when  many  single-vector 
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Lanczos  iterations  are  performed  after  deflation.  The  approximations  obtained  with  GrABL  using 
reflnement  are  closest  to  the  random  projection  method  when  there  are  only  a  few  iterations  after 
deflation.  These  two  observations  suggest  that  GrABL  is  well  suited  for  reduction  to  intermediate 
dimensions  between  t'  and  t  in  Figure  5.2. 

6.6  Conclusion 

GrABL  is  intended  to  offer  a  middle  ground  between  the  random  projection  algorithm  and  minimal 
Krylov  subspace  methods.  The  random  projection  algorithm  produces  good  eigenvalue  approxima¬ 
tions,  but  at  a  premium  expense;  smaller  block  sizes  reduce  the  cost  per  iteration,  but  at  the  expense 
of  admitting  Ritz  pairs  that  approximate  trailing  eigenpairs  into  the  subspace.  The  overarching 
motivation  behind  GrABL  is  to  increase  block  size  until  it  is  no  longer  advantageous,  and  deflate  ag¬ 
gressively  thereafter  to  minimize  the  cost  per  iteration.  GrABL  also  is  intended  to  provide  adaptive 
block  size  capability  similar  to  ABLE,  but  without  the  requirement  for  beforehand  knowledge  of  the 
spectrum  of  the  input  matrix.  With  appropriate  deflation,  GrABL  has  an  overall  lower  FLOP  count 
than  the  random  projection  algorithm. 

GrABL  is  specialized  for  low-rank  matrix  approximations  and  related  generic  dimension  reduc¬ 
tion  problems  for  which  the  optimal  truncated  eigenvalue  decomposition  is  prohibitively  expensive. 
Problems  with  many  tightly  clustered  leading  eigenvalues  or  a  flat  spectrum  are  not  appropriate  for 
GrABL.  Yet  many  low-rank  approximation  and  generic  dimension  reduction  problems  typically  have 
a  small  number  of  well-separated  leading  eigenvalues.  In  these  cases,  considerable  advantages  may 
are  realizable  from  employing  block  Krylov  subspaces  rather  than  single-vector  Krylov  subspaces 
to  approximate  the  well-separated  eigenvalues.  Maximal  block  sizes  used  in  the  random  projection 
algorithm  generally  give  satisfactory  results,  though  their  computational  costs  are  high  due  to  costs 
quadratic  in  the  block  size.  Use  of  a  smaller  block  size  would  mitigate  these  costs,  though  the  best 
small  block  size  may  not  be  known  a  priori.  When  matrix-vector  products  are  inexpensive,  GrABL 
can  produce  low-rank  approximations  that  are  competitive  with  the  random  projection  method  but 
less  expensive  to  generate,  while  still  being  higher  quality  than  a  minimal  Krylov  subspace  gener¬ 
ated  with  a  single-vector  Lanczos  approach. 
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Inner  iteration  for  minimal  Krylov  subspaces 


Minimal  Krylov  subspaces  offer  compute  time  advantages  over  eigenvector  spaces  for  a  wide  variety 
of  low-rank  approximation  and  dimension  reduction  tasks,  such  as  PCA,  POD  and  LSI.  All  of  these 
techniques  use  a  truncated  spectral  decomposition  of  a  positive- definite  Gram  matrix  to  effect  a 
reduction  in  dimensions.  Minimal  Krylov  subspace  projections  are  well-suited  as  an  alternative  to 
the  truncated  spectral  decomposition  for  these  problems,  as  the  eigenvalues  and  eigenvectors  used 
in  the  truncated  spectral  decomposition  are  also  those  eigenvalues  and  eigenvectors  that  converge 
first  in  a  Krylov  subspace.  These  eigenvalues  tend  to  be  sufficiently  separated  from  their  neighbors 
as  to  converge  with  satisfactory  speed  in  a  Krylov  subspace.  However,  satisfactory  convergence 
does  not  necessarily  imply  that  convergence  is  complete  in  any  minimal  Krylov  subspace,  or  that 
the  eigenvalue  approximations  given  by  a  Krylov  subspace  are  sufficiently  converged  as  to  produce 
a  satisfactory  low-rank  matrix  approximation.  Even  with  leading  eigenvalue  problems,  utilization 
of  some  acceleration  methods  may  produce  better  low-rank  approximation  results  than  a  simple 
minimal  Krylov  subspace.  Simple  minimal  Krylov  subspaces  may  be  attractive  in  some  cases  due  to 
their  storage  attributes;  only  k  Lanczos  basis  vectors  need  be  stored.  In  fact,  in  some  cases,  it  is  not 
possible  to  store  any  basis  larger  than  that  for  a  minimal  Krylov  subspace.  We  focus  on  these  cases, 
and  finding  ways  to  produce  better  Krylov  subspaces  when  storage  for  basis  vectors  is  constrained. 

As  minimal  Krylov  subspaces  represent  the  computationally  least  expensive  Krylov  subspace 
possible,  an  acceleration  method  applied  to  them  should  also  be  proportionally  inexpensive;  it  would 
be  useless  to  apply  an  expensive  eigenvalue  acceleration  method  to  a  minimal  Krylov  subspace  and 
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thereby  ruin  its  computational  advantages.  Shift-and-invert  preconditioning  methods  do  exist  for 
Krylov  subspace  methods  for  the  eigenproblem;  however,  they  are  computational  over-kill  for  the 
types  of  leading  eigenvalue  problems  typically  encountered  in  the  aforementioned  analysis  meth¬ 
ods.  Shift-and-invert  requires  factorization  of  the  input  matrix,  which  is  expensive  in  general.  The 
straightforward  alternative  of  simply  extending  the  Krylov  subspace  likewise  introduces  additional 
computational  costs  beyond  the  minimal  Krylov  subspace  both  in  terms  of  time  costs  and  storage 
costs.  Indeed,  there  may  be  cases  for  which  there  may  simply  not  be  sufficient  memory  to  compute  a 
non-minimal  Krylov  subspace. 

In  contrast  to  shift-and-invert  preconditioning,  inner  iteration  may  be  comparatively  inexpen¬ 
sive.  Though  not  as  powerful  as  shift-and-invert  preconditioning,  inner  power  iteration  is  also  not  as 
costly.  Rather  than  requiring  the  inversion  of  the  input  matrix,  either  implicitly  or  explicitly,  inner 
power  iteration  only  requires  extra  matrix-vector  products.  When  the  input  matrix  is  sparse,  these 
matrix-vector  products  may  be  inexpensive  compared  to  the  dense  linear  algebra  operations  needed 
to  form  the  Krylov  subspace  and  maintain  orthogonality  of  its  basis  vectors.  Inner  power  iteration 
transforms  the  Krylov  subspace  from  ^i{A,xo)  to  ,xo)-  Local  gaps  A;  -  Aj+i  between  eigen¬ 

values  are  expanded,  resulting  in  faster  convergence  of  leading  eigenvalues,  especially  when  they 
are  already  well-separated.  An  alternate  interpretation  is  that  ^HpiA^ ,xq)  is  an  approximation  to 
the  longer  subspace  ^i(A,xo).  Both  interpretations  lend  insight  into  the  improved  performance  of 
Krylov  subspace  projections  with  inner  iteration  for  generation  of  minimal  Krylov  subspaces. 

7.1  Background 

We  consider  applications  of  Krylov  subspaces  as  approximations  to  a  truncated  eigenspace  for  low- 
rank  approximation  problems.  The  low-rank  approximation  problem  may  range  from  PCA  or  PCA- 
like  applications  to  regularization  of  ill-posed  problems.  These  classes  of  problems  are  optimally 
solved  with  projection  into  the  truncated  eigenspace;  the  truncated  ^-dimensional  eigenvector  projec¬ 
tion  not  only  produces  the  rank-^  approximation  with  largest  Frobenius  norm,  but  also  has  filtering 
properties  as  well.  For  example,  PCA  not  only  finds  a  Frobenius  norm-optimal  approximation,  but 
also  appears  to  separate  global  structure  from  local  deviations  or  noise.  Thus,  leading  eigenvectors 
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used  for  a  PCA  approximation  correspond  to  global  components,  and  the  trailing  eigenvectors  ex¬ 
cluded  from  a  PCA  approximation  represent  noisy,  local  features.  Substitution  of  a  minimal  Krylov 
subspace  for  a  truncated  eigenspace  results  in  approximations  with  smaller  Frobenius  norm,  and 
introduction  of  noisy  components  into  the  approximation.  Minimal  Krylov  subspaces  will  produce 
eigenvalue  approximations  that  have  not  converged  to  any  particular  eigenvalue.  Some  important 
eigenvalues  responsible  for  a  large  proportion  of  the  Frobenius  norm  may  not  be  completely  con¬ 
verged  in  a  minimal  Krylov  subspace,  even  when  the  important  eigenvalues  are  well-separated.  The 
most  straightforward  illustration  of  the  problems  caused  by  non-convergence  of  eigenvalues  in  a 
spectral  decomposition  is  low-rank  matrix  approximation  for  regularization  of  a  positive  definite  lin¬ 
ear  system.  In  this  application  of  a  truncated  spectral  decomposition,  the  linear  system  Ax  =  b  is 
replaced  by  a  ^-rank  approximation  =  b,  in  which  all  eigenvalues  of  A  less  than  e  are  set  to  0. 
The  goal  of  regularization  is  to  limit  the  norm  of  the  solution  ||x||  when  such  large-normed  solutions 
represent  physically  unlikely  solutions.  This  matrix  regularization  problem  is  solved  with  the  trun¬ 
cated  spectral  decomposition  with  A^^^  =  UkAjJj'^,  with  =  diag(Ai,  A2, . . . ,  where  A;  >  c  for  i  <  ^ 
and  Uk  =  \.ui  U2  ...  Uk\-  The  removal  of  small  eigenvalues  has  a  substantial  impact  on  how  well  the 
low-rank  approximation  regularizes  the  problem.  The  exclusion  of  eigenvalues  less  than  e  assures 
that  there  is  no  b  such  that  ||A^^^  Replacement  of  the  eigenspace  spanned  by  columns  of 

Uk  with  J^k(A,b)  will  result  in  some  eigenvalues  of  A^^^  possibly  having  magnitude  smaller  than  e, 
and  the  resulting  violation  of  the  guarantee  ||A^^^  ^ 

We  recall  that  minimal  Krylov  subspaces  have  a  particular  advantage  owed  to  their  minimally- 
short  nature.  No  other  Krylov  subspace  is  less  expensive  to  produce.  Therefore,  they  represent  a 
compute-cost  lower  bounds  on  eigenspace  approximation  with  Krylov  subspaces.  Choice  of  an  accel¬ 
eration  method  must  be  informed  by  the  additional  compute  costs  introduced.  When  non-convergence 
of  spectral  components  is  a  problem,  it  may  be  solved  with  a  non-minimal  Krylov  subspace  followed 
by  Ritz  vector  truncation.  Memory  constraints  may  not  permit  generation  of  a  non-minimal  Krylov 
subspace;  storage  of  more  than  k  Lanczos  vectors  may  simply  not  be  possible.  For  example,  latent 
semantic  analysis  may  require  hundreds  of  dimensions;  storing  100  dimensions  of  even  a  corpus  with 
10^  documents  or  terms  will  require  on  the  order  10®  bytes,  roughly  762  MB  if  double-precision  float- 
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ing  point  numbers  are  used.  This  motivates  investigation  of  methods  to  accelerate  convergence  of  the 
leading  spectral  components  required  for  low-rank  approximation  and  exclusion  of  the  trailing  spec¬ 
tral  components  representing  noise.  Methods  to  target  a  neighborhood  of  the  input  matrix  spectrum 
have  been  previously  studied,  but  present  unwarranted  expense  for  the  minimal  Krylov  subspace 
problem  when  leading  spectral  components  are  required.  Use  of  shift-and-invert  preconditioning  to 
accelerate  convergence  of  leading  eigenvalues  imposes  compute  costs  that  are  generally  prohibitive; 
the  input  matrix  must  be  inverted.  Even  when  the  input  matrix  is  sparse  and  matrix  inversion  is  not 
cubic  in  complexity,  the  structure  of  the  input  matrix  may  lead  to  non-trivial  fill-in  or  inversion  costs. 
Therefore,  we  must  choose  an  acceleration  method  that  is  computationally  inexpensive,  especially 
relative  to  minimal  Krylov  subspaces. 

Use  of  power  iteration  to  expand  local  gaps  and  accelerate  convergence  in  a  projective  eigenvalue 
method  was  proposed  in  [36],  but  is  also  comparable  to  inner  iteration  used  in  flexible  GMRES  [69] 
for  preconditioning  linear  systems.  In  flexible  GMRES,  the  preconditioner  is  allowed  to  change  from 
iteration  to  iteration,  which  allows  for  an  iterative  solver  to  be  used  as  a  preconditioner  instead  of 
a  static  approach  such  as  a  ILU  or  SSOR  preconditioner.  Any  iterative  method  could  be  used  to 
approximate  A~^  in  flexible  GMRES,  even  GMRES  itself  We  note  that  the  linear  systems  precon¬ 
ditioning  problem  is  different  from  the  eigenvalue  preconditioning  problem;  for  linear  systems,  an 
effective  preconditioner  is  a  matrix  M~^  that  approximates  A~^  and  such  that  the  condition  number 
k{M~^A)  is  small.  Inner  power  iteration  is  similar  to  using  p  steps  of  conjugate  gradient  as  an  in¬ 
ner  iteration  in  GMRES,  but  the  Krylov  subspaces  generated  differ.  Inner  power  iteration  produces 
,xq)  whereas  flexible  GMRES  produces  span{xo,q'i(A)xo,q'2(A)xo}  for  polynomials  g;  of  order 
ip.  When  the  input  matrix  is  symmetric,  inner  power  iteration  with  p  =  2  produces  the  same  Krylov 
subspace  as  the  normal  equations,  as  both  solve  the  equation  A^Ax  =  A^b.  Though  power  iteration 
is  a  eigenvalue  method  in  its  own  right,  its  slow  convergence  for  single  vectors  renders  it  ineffective 
for  practical  use.  Nevertheless,  combination  of  inner  power  iteration  with  Lanczos  outer  iteration 
produces  an  eigenvalue  approximation  method  that  approximates  leading  spectral  components  more 
successfully  than  ordinary  Lanczos  iteration,  without  the  need  for  expensive  and  complicated  im¬ 
plicit  restart  methods.  First,  we  review  power  iteration  methods  and  Krylov  subspace  methods  for 
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the  eigenproblem.  Analysis  of  the  convergence  of  eigenvalues  in  Krylov  subspaces  indicates  the  po¬ 
tential  effects  of  inner  power  iteration;  we  then  present  the  Lanczos  algorithm  modified  with  inner 
power  iteration. 

7.2  Inner  iteration  with  Krylov  subspace  methods  for  the  eigenproblem 

Krylov  subspaces  are  deeply  related  to  power  iteration;  in  fact,  a  Krylov  is  the  very  space  spanned 
by  the  successive  products  of  power  iteration.  Recall  that  given  a  matrix  A  and  a  suitable  vector  xq, 
a  Krylov  subspace  is  given  by 

(7.1)  =  spanlxo,  Axo,A^xo,. . .  ,A'“^xo}. 

For  minimal  Krylov  subspaces,  substitution  of  the  input  matrix  A  with  A^  will  increase  the 
leading  eigengaps.  This  substitution  improves  the  quality  of  the  leading  eigenvalue  approximations 
in  the  subspace  produced.  The  direct  formation  of  A^  may  be  expensive  in  terms  of  compute  time, 
and  may  further  lead  to  loss  of  sparsity.  Instead  of  explicit  formation  of  A^  as  an  input  matrix  to 
the  Lanczos  routine,  A^  may  be  formed  via  an  inner  iteration.  Such  an  approach  is  similar  to  other 
inner  iteration  preconditioning  methods  for  Krylov  subspaces,  which  solve  a  linear  system  to  affect 
a  shift-and-invert  preconditioning  of  A.  The  resulting  algorithm  is  presented  in  Algorithm  11. 

Unfortunately,  adding  inner  power  iteration  causes  some  difficulty  for  use  of  the  Lanczos  method 
for  eigenvalue  approximation.  Given  the  input  matrix  AP  and  the  output  matrices  Q  and  T  from  the 
Lanczos  algorithm,  we  have 

(7.2)  Q^APQ  =  T 

and  a  Ritz  value  =  vJAPvi  for  some  Ritz  vector  Vi  approximates  an  eigenvalue  of  AP  from  below. 
However,  it  is  not  the  case  that  =  vjAvi  due  to  round-off  error.  Therefore,  once  augmented 
with  inner  iteration,  the  Lanczos  method  cannot  be  used  to  approximate  eigenvalues  of  A  directly  as 
lower  bounds.  Instead,  the  input  matrix  must  be  projected  into  the  Krylov  subspace  after  Lanczos 
iteration  to  obtain  lower  eigenvalue  approximations.  The  expense  of  this  second  projection  is  not  as 
severe  as  it  may  seem  if  one  compares  the  approach  to  Algorithm  3.  Both  approaches  first  generate  an 
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Algorithm  11  Lanczos  with  inner  power  iteration 
Require;  a  priori  chosen  start  vector  gi,  power  p 

1:  r^qi 

2:  for  i  =  1  ^  p  do 
3:  r^Ar 

4:  end  for 
5:  for  j  =  l^k  do 
6:  aj 

7:  r^r-qjaj 

8:  ^j+l  =  \\qj+l\\ 

9:  r^qi 

10:  for  i  =  1  ^  p  do 

11:  r^Ar 

12:  end  for 

13:  r^r-qjPj+l 

14:  end  for 

Oil  ^2 

^2  0C2  h 

15:  return  [qi  q2  ...], 

h  “3 
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orthonormal  basis  for  a  ^-dimensional  space  using  k  sparse  matrix-vector  products  and  dense  linear 
algebra  operations  using  no  more  than  0(nk^)  FLOPS  and  project  the  matrix  onto  the  orthonormal 
basis.  The  complexity  of  using  the  basis  from  k  iterations  of  the  Lanczos  algorithm  is  no  worse  than 
the  random  projection  method.  It  is  likely  that  when  selective  reorthogonalization  is  used,  then  the 
Lanczos  method  will  likely  require  less  than  0{nk^)  FLOPS  to  generate  its  projection  matrix,  and 
will  be  faster  than  the  random  projection  method. 

7.3  Convergence  of  eigenvolues 

Adjustment  of  the  input  matrix  spectrum  influences  convergence  of  eigenvalues.  Often,  Krylov  sub¬ 
space  projections  offer  fast  convergence  of  leading  eigenvalues;  for  many  low-rank  approximation 
problems,  leading  eigenvalues  are  well-separated  and  converge  with  satisfactory  rapidity.  Never¬ 
theless,  inner  power  iteration  can  accelerate  convergence  of  leading  eigenvalues.  The  mechanism 
by  which  inner  iteration  accelerates  convergence  is  intuitive;  it  expands  the  local  eigenvalue  sepa¬ 
ration  that  drives  eigenvalue  convergence  in  Krylov  subspaces.  An  alternate  intuition  is  that  the 
Krylov  subspace  ^i{AP,xo)  approximates  the  larger  Krylov  subspace  Jifip{A,xo).  Both  ideas  are 
illuminating,  but  the  latter  of  the  two  leads  to  a  more  elegant  formal  result.  Indeed,  there  is  a  rela¬ 
tionship  between  the  optimal  minimizing  polynomials  for  ||g(A)xoll  for  AP  and  A;  this  relationship 
between  polynomials  yields  worst-cases  comparison  between  eigenvalue  estimates  for  Jlfi(A^,xo)  and 
JFi(A,xo).  We  begin  by  generalizing  the  worst-case  bounds  from  [67]  to  compare  the  optimal  degree-i 
polynomial  to  the  non-optimal  degree  ip  polynomial.  This  difference  leads  directly  to  a  restatement 
of  the  Saad  bounds  [67]. 

The  context  of  these  re-derivation  of  bounds  for  Krylov  subspace  approximations  of  the  form 
^i(A,Xo)  lies  in  our  original  low-rank  approximation  task.  This  context  motivates  bounds  that 
allow  us  to  compare  the  convergence  of  eigenvalues  in  Jifi/p{AP ,xo)  to  convergence  of  eigenvalues 
in  Jlfj(A,xo)  so  that  we  may  quantify  the  loss  of  accuracy  when  substituting  a  power  iteration- 
augmented  minimal  Krylov  subspace  for  a  non-minimal  one. 
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Recall  Theorem  2  from  [67],  where  the  error  in  eigenvalue  approximation  is  bounded  as 


(7.3) 


■Xi 


where 


rr 

n  ,(„)  . 

X'-feaf  A.  -  Ai 


and  =  1 


with  is  the  set  of  the  first  i  -  1  approximate  eigenvalues,  Tj{-)  is  the  Chebyshev  polynomial  of 
the  first  kind  with  order  j,  and 


fj  =  1  + 


2(Aj  -  Aj+i) 
Aj+i  ~  Ajjjf 


The  form  of  these  bounds  is  a  result  of  the  relationship  between  the  Krylov  subspace  J^i(A,xo)  and 
the  set  of  polynomials  of  degree  no  greater  than  i.  In  Jlf)(A,xo),  one  may  construct  a  degree  i  -  1 
polynomial  q(x)  that  is  optimal  for  minimizing  ||g(A)xol|.  Furthermore,  the  roots  of  this  polynomial 
q{x)  are  the  approximate  eigenvalues  of  the  projection  of  A  into  JFj(A,xo)  [66, 67, 82].  A  consequence 
of  this  property  of  minimization  of  ||  g(A)xo  II  is  that  no  other  polynomial  fix)  of  degree  i  - 1  results  in  a 
smaller  ||/■(A)xoll  than  g(x).  The  Chebyshev  polynomial  in  the  denominator  of  (7.3)  is  a  result  of  this 
optimality.  Thus,  we  may  expect  that  changes  to  the  structure  of  the  Krylov  subspace  and  resulting 
optimal  polynomial  g(x)  will  result  in  a  change  of  the  Chebyshev  polynomial.  The  Krylov  subspace 
JCi(A,xo)  admits  an  optimal  polynomial  q(x)  of  degree  no  greater  than  i  — 1  for  minimization  of 
||g(A)xoll  and  the  Krylov  subspace  JCj(A^,xo)  admits  an  optimal  polynomial  r(x)  of  degree  no  greater 
than  i  — 1  for  minimization  of  ||r(A^)xol|.  If  is  the  set  of  all  polynomials  of  degree  no  larger  than 

i-1,  then  we  have  both  g(x)  e  and  r(x)  e  There  is  then  a  apparent  relationship  between 

qix)  and  r(x)  that  we  exploit  in  the  following  theorem  to  show  how  the  bounds  for  (7.3)  change  with 
inner  power  iteration. 


Theorem  11.  Consider  a  Krylov  subspace  JCre(A,A’o)  with  block  size  r.  Let  A;  be  an  eigenvalue  of 
the  matrix  A  with  associated  eigenvector  (pi  with  \\(pi\\  =  1.  Let  p  be  a  natural  number.  Assume  that 
the  vectors  niiA)(pj  are  linearly  independent  where  niiA)  is  the  orthogonal  projection  onEi,  the  space 
spanned  by  the  initial  start  vectors  of  the  Krylov  subspace.  Let  9i  be  the  approximation  of  Xf  from  the 
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projection  of  into  JifniAP  ,xo)-  Then  the  error  of  the  approximate  eigenvalue  as  generated  hy 
the  projection  of  A  into  J^^niA^  ,xq)  for  approximating  Aj  is  bounded  as 


(7.4) 


0<A^-A‘”><(Ai-Ainf) 


<  iav?  0{(pi,XQ)} 

Tn-iiSi) 


where 


K 


(n,p) 


rr  ^inf 


and  =  1 


with  is  the  set  of  the  first  i-1  approximate  eigenvalues  of  AP,  Tj{-)  is  the  Chebyshev  polynomial 
of  the  first  kind  of  order  j  and 


2(Af-Af,i) 

d;  =  l+— 4 - 

1+1  mi 

Proof  We  proceed  in  the  same  manner  as  Theorem  2  in  [67].  By  the  Courant  characterization  of 
eigenvalues  of  symmetric  operators,  we  have 


(7.5) 


A*."^  =  max 


u^APu 


u^F) 


w  u'^u 


Then 

(7.6) 


0  <  A;  —  a(”^  =  min 


uHi\^iI-AP)u 


ueF] 


in) 


T 
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There  is  an  optimal  polynomial  q{x)  of  degree  no  greater  than  n  —  1  that  minimizes  ||g(A^)xol|. 
Then  u  =  q{AP)xQ  =  Y!f=iCijq{X^j)(pj.  Consider  a  transform  function  operating  on  polynomials  f{x)  = 
'rjj^QdjX^ ,  with 
(7.7) 


71-1 


m{f){x)  =  X  a 


•  vPJ 


7=1 


Then  q(AP)  =  m{q){A).  Therefore  m(q){x)  also  minimizes  ||m(g)(A)xoll  and  u  =  m(g)(A)xo.  Note 
that  though  q{x)  e  but  m{q){x)  t  The  mapping  m{f){x)  in  fact  defines  a  new  set  of 

polynomials  =  {m(g)||g  e  ^(”-i)}  for  any  set  of  polynomials  where  all  polynomials  in 

^(re-i)  degree  no  greater  than  pin  —  1)  and  have  no  nonzero  coefficients  for  any  power  of  x  not 
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integer  divisible  by  p.  Let  r{x)  =  m(q){x).  Then 


u^{Xi-A)u  1 


^{Xi-Xj)r{Xjfa)+  £  {Xi-Xj)r{Xjfa) 


‘  ®  u^u  ■ 

Since  r{x)  =  qix^)  =  0  r(x)  =  0.  From  Lemma  3  in  [67],  we  have  qiOj)  =  0  for 

1  <  j  <  i,  so  r(^/0j)  =  0  for  1  <  j  <  i,  and  r(x)  has  n  -  1  real  roots.  Then  r(x)  may  be  written  as 
r(x)  =  f{x)YYj~J^{x-  where  fix)  =  ^  Following  the  derivations  in  [67], 


we  have 


(7.10) 


u^{XiI-A)u  ^  ^  ^  iXj- Wi)^---iXj- i/(hIl)^fiXj)^a^j 

u^u  iXi  -  ...iXi-  ^/^i)^f{XM 


and  noting  that  ^/Oj  >  Ainf  for  any  j  leads  to 


(7.11) 


u^{XiI-A)u  ^  ^-i(^-Ai„f)2  ^  ajfiXj)'^ 

u^u  ^"7=1  (^-A,)2 


As  r(x)  =  n”_;J(x-  i/Oj)  is  the  polynomial  in  =  ^T.jJiCiiX^^\ai  e  Ir|  that  minimizes  (7.8),  it  is 

implied  that  fix)  minimizes  T.^-i.Aa^ifiXX^yia'^fiXi)'^).  Thus 

J  —  IX-  L  J  J  I 


(7.12) 
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We  may  separate  terms  in  the  right-most  side  of  (7.12)  and  obtain 


(7.13) 
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Noting  that  T^-iiS)  e  for  S  =  1  -i-2(x^  -  'lf+i)/(^f+i  ^  di4’i>^o),  we 


(7.14) 
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Combining  (7.12)  and  (7.14)  3delds 


(7.15) 


0<Ai-A|."^<(Ar 


■  Ainf) 


ian^  0((pi,XQ) 

TlASi) 


which  completes  the  proof. 


□ 


Note  that  the  value  in  (7.4)  differs  from  that  obtained  by  simply  substituting  A^  into  the  bounds 
from  [67],  which  would  be 


(7.16) 


0  <  A^  -  A^.”*  <  (A^  -  A^  A 

I  I  I  ini 


(  6{(pi,XQ) 


Tl  A5i) 


A  notable  consequence  of  this  theorem  is  that  eigenvalues  of  A  converge  in  JCj(A^,xo)  at  least  as 
fast  as  eigenvalues  of  A^.  Since  local  gaps  in  the  spectrum  are  expanded  by  the  power  p,  we  may 
expect  6i  to  be  larger  than  ji,  resulting  in  faster  convergence,  especially  with  successive  iterations. 


7.4  Examples 

To  show  the  effects  of  inner  power  iteration  on  the  theoretical  asymptotic  convergence  and  a  poste¬ 
riori  convergence  of  eigenvalue  approximations  in  Krylov  subspaces,  we  present  numerical  results. 
We  use  two  matrices  in  the  experiment:  both  exhibit  the  spectra  typical  of  low-rank  approxima¬ 
tion  problems  encountered  in  PCA-like  dimension  reduction  problems.  The  first  problem  arises  form 
structural  analysis  and  involves  a  stiffness  matrix.  The  second  experiment  is  on  a  much  larger 
matrix  derived  from  a  information  retrieval  problem. 

7.4.1  nos5  stiffness  matrix 

The  nos5  stiffness  matrix  from  the  Harwell-Boeing  matrix  collection  [20]  arises  from  finite-elements 
analysis  of  a  building.  Buildings  may  typically  be  treated  as  stiff  structures  with  negligible  damping; 
therefore,  natural  frequencies  may  be  calculated  from  the  stiffness  matrix  alone.  The  most  impor¬ 
tant  frequencies  of  interest  are  those  eigenvalues  of  A  with  smallest  magnitude,  but  if  A  is  easily 
inverted  —  for  example,  with  a  sparse  Cholesky  factorization  —  then  A~^  may  be  used,  and  trailing 
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Figure  7.1:  Spectrum  of  the  inverted  nos5  stiffness  matrix. 


eigenvalue 

1 

2 

3 

jtri2(A,xo) 

1.53  X  10“^*' 

7.1  X  10“^^ 

2.04  X  10“^ 

Ji%(A^,xo) 

1.28x10“^ 

1.4  X  10“^ 

2.52  X  10^ 

Jt3(A,xo) 

1.14x10“^ 

1.98 

3.11 X 10^ 

Table  7.1:  Bounds  for  eigenvalue  approximations  using  JFi2(A,xo),  J^s{A,xo),  and  J^s{A'^,xo)  with  random  xq 
for  the  inverted  nos5  matrix. 


eigenvalues  of  A  become  leading  eigenvalues  of  A“^. 

To  study  the  convergence  of  eigenvalues  of  the  inverted  nos5  matrix,  we  generated  a  random 
start  vector  xo,  and  examine  the  bounds  predicted  by  Theorem  11  and  actual  eigenvalue  errors  for 
the  first  3  eigenvalues  of  the  inverted  A  in  JFi2(A,xo),  the  minimal  Jt3(A^,xo)  and,  for  a  baseline,  the 
minimal  JF3(A,xo).  The  worst-case  bounds  are  presented  in  Table  7.1,  and  the  errors  are  presented 
in  Table  7.2.  It  is  important  to  note  that  the  tightness  of  the  bounds  should  be  considered  in  terms 
of  relative  error;  that  is,  the  worst-case  error  for  the  leading  eigenvalue  approximation  in  Jlfi2(A,xo) 
is  1.54  X  10“^*^,  which  is  8.15  x  10“®  percent  error. 

It  is  clear  that  the  worst-case  bounds  are  pessimistic  in  these  cases;  notably  for  A3.  The  addition 
of  inner  power  iteration  does  not  obtain  bounds  as  tight  as  the  non-minimal  Krylov  subspace,  but 


eigenvalue 

1 

2 

3 

JFi2(A,Xo) 

-3.47  X  10“^^* 

5.2  X  10“^^* 

7.01  X  10“^^ 

Jt3(A'^,xo) 

1.59x10“^ 

1.47  X  10“^ 

7.75 X 10“^ 

JF3(A,xo) 

2.06x10“^ 

5.78  X  10“^ 

8.51  X  10“^ 

Table  7.2:  Eigenvalue  approximation  errors  using  J^2o(A,xo),  JF5(A,xo),  and  J^5{A‘^,xo)  with  random  xo  for 
the  inverted  nos5  matrix.  *  indicates  a  value  outside  of  machine  precision. 
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dimensions  dimensions 

Figure  7.2:  Approximation  error  ||A-A**'||,  for  inverted  nos5  stiffness  matrix  A  (left).  The  approximation 
error  using  the  baseline  minimal  Krylov  subspace  J?r^.(A,xo)  is  shown  on  the  right. 

still  represents  bounds  two  orders  of  magnitude  tighter  than  the  minimal  Krylov  subspace  generated 
without  inner  iteration.  We  note  that  Jt3(A^,xo)  obtains  less  than  1%  relative  error  for  Ai  and  A2. 
Though  JFi2(A,xo)  has  less  than  10“^  percent  error  for  any  of  Ai,  A2  or  A3,  this  extra  tightness  is 
only  significant  for  A3,  which  J^3(A'^,xo)  approximates  poorly.  By  way  of  comparison,  Jt3(A,xo)  has 
at  least  10%  error  for  all  eigenvalue  approximations.  Therefore,  the  relaxation  to  the  minimal  Krylov 
subspace  Jt3(A‘^,xo)  will  not  sacrifice  much  accuracy  when  it  is  computationally  less  expensive  than 
the  non-minimal  Krylov  subspace. 

The  impact  of  the  errors  may  also  be  viewed  in  terms  of  the  low-rank  approximation  error  of  A 
in  a  Krylov  subspace.  Figure  7.2  shows  the  low-rank  approximation  errors  of  A  in  the  3  respective 
Krylov  subspaces,  with  the  approximation  generated  by  the  truncated  SVD  for  a  baseline  compari¬ 
son.  Though  the  non-minimal  Krylov  subspace  is  almost  indistinguishable  from  the  truncated  SVD 
approximation,  the  minimal  Krylov  subspace  with  inner  power  iteration  becomes  ever  closer  with 
increasing  dimension.  This  is  significant,  as  the  non-minimal  Krylov  subspace  requires  four  times 
as  much  memory. 

7.4.2  Enron  email  corpus  experiment 

To  study  the  effects  of  inner  power  iteration  on  convergence  of  eigenvalues  for  an  LSI  application,  we 
use  the  Enron  email  corpus  as  collected  from  the  UCI  Machine  Learning  Repository  [26].  The  matrix 
was  approximated  in  the  Krylov  subspace  JFi(AA^,xo)  for  a  random  xq.  The  first  100  eigenvalues  of 
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eigenvalue  index 


Figure  7.3:  Spectrum  of  the  centered  covariance  matrix  of  the  Enron  email  data. 


eigenvalue 

1 

2 

3 

4 

5 

JC2o(A,xo) 

1.16x10“^ 

5.25  X  10“^ 

1.81x10“^ 

7.48 

1x10“ 

JF5(A^,xo) 

0.29 

3.0 

2.93  X 10^ 

2.32  X  10*^ 

2.96  X  10“ 

Jt5(A,xo) 

12.36 

2.54 X 10^ 

9.03  X  10^ 

1.75x10*^ 

8.21  X  10^ 

Table  7.3:  Bounds  for  eigenvalue  approximations  using  ^5{A,xo),  and  J^5{A'^,xo)  with  random  xq 

for  the  Enron  email  data. 

the  matrix  AA^  are  shown  in  Figure  7.3. 

We  begin  by  exploring  the  theoretical  bounds  for  approximation  with  a  non-minimal  Krylov  sub¬ 
space  without  inner  iteration  to  approximation  with  a  minimal  one  using  inner  iteration;  that  is, 
we  compare  the  rank-k  approximation  from  Jtf^p(A,xo)  to  the  approximation  from  Jtf^(A^,xo).  Note 
that  the  length  of  the  non-minimal  Krylov  subspace  grows  quickly,  even  for  small  p.  For  comparison, 
we  set  k  =  5  and  p  =  4.  The  resulting  bounds  from  Theorem  11  are  given  in  Table  7.3.  Clearly,  the 
worst-case  approximations  for  Jt2o(A,xo)  are  better  than  those  for  Jt5(A^,xo).  Much  of  this  differ¬ 
ence  is  due  to  the  smaller  value  obtained  from  the  Chebyshev  polynomial  in  the  denominator  of  (7.4); 
the  order  is  much  larger  for  a  20-dimensional  subspace.  Nevertheless,  the  expansion  of  local  gaps 
renders  the  worst-case  bounds  for  the  leading  eigenvalue  are  relatively  small,  and  the  bounds  with 
inner  power  iteration  are  better  than  the  bounds  for  the  minimal  Krylov  subspace  Jt5(A,xo).  The 
approximation  errors  are  shown  in  Table  7.4.  In  this  case,  the  worst-case  bounds  are  rather  pes¬ 
simistic,  but  the  inner  power  iteration  produces  better  eigenvalue  approximations  than  the  baseline 
Krylov  subspace  Jt5(A,xo). 

To  show  the  improvement  in  low-rank  approximation  using  a  non-minimal  Krylov  subspace  and 
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eigenvalue 

1 

2 

3 

4 

5 

Jt2o(A,xo) 

1.78  X  10“^®* 

7.73  X  10“^^* 

1.67  X  10“^^ 

2.51x10“^ 

1.39x10“^ 

JF5(A^,xo) 

3.54  X  10“^ 

1.66  X  10“^ 

3.46  X  10“^ 

0.27 

1.95 

Jt5(A,xo) 

2.51  X  10“^ 

0.32 

1.01 

1.75 

2.07 

Table  7.4:  Eigenvalue  approximation  errors  using  J£2o(A,xo),  ^5(A,xo),  and  ^5(A‘^,xo)  with  random  xq  for 
the  Enron  email  data.  *  indicates  a  value  outside  of  machine  precision. 


Eigure  7.4:  Approximation  error  ||A- for  AA^  using  the  Enron  email  data  (left).  The  approximation 
error  using  the  baseline  minimal  Krylov  subspace  J^k(A,xo)  is  shown  on  the  right. 


a  minimal  Krylov  subspace  augmented  with  inner  power  iteration,  we  generate  ^-rank  approxima¬ 
tions  using  Jt4p(A,xO)  and  Jt4(A^,xo)  for  p  =  4  and  l<k<  64.  The  approximation  error  ||A  -A^^^Wf 
is  shown  in  Figure  7.4.  The  plot  includes  the  error  associated  with  a  low-rank  approximation  gen¬ 
erated  from  the  truncated  SVD;  this  serves  as  a  baseline  for  the  Frobenius-norm  optimal  ^-rank 
approximation.  The  low-rank  approximation  using  truncated  Ritz  vector  approximation  is  almost 
equivalent  to  the  truncated  SVD;  the  inner  iteration  relaxation  lags  behind.  Nevertheless,  the  inner 
iteration  Krylov  subspace  has  better  eigenvalue  approximations  than  the  baseline  minimal  Krylov 
subspace  when  no  inner  iteration  is  used. 

To  show  the  relative  work  required  to  generate  the  low-rank  approximations,  we  present  the  the¬ 
oretical  FLOP  counts  for  the  Lanczos  algorithm  to  generate  both  ^^p(A,xo)  and  J(4(A^,xo).  These 
costs  are  presented  in  Figure  7.5;  these  costs  include  partial  reorthogonalization  costs.  We  assume 
that  the  number  of  converged  Ritz  vectors  that  must  be  excluded  from  each  new  basis  vector  are 
asymptotically  radical  in  the  number  of  dimensions  of  the  Krylov  subspace.  It  is  further  assumed 
that  orthogonality  is  lost  faster  with  inner  power  iteration;  a  conservative  estimate  of  a  factor  of  4 
times  more  work  would  be  necessary  to  maintain  orthogonality  when  inner  iteration  is  used.  The 
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2°  2^  2^  2^  2^  2®  2® 


dimensions 

Figure  7.5:  Theoretical  FLOP  count  for  generating  Krylov  subspaces  using  the  Lanczos  algorithm.  It  is  as¬ 
sumed  that  inner  power  iteration  will  result  in  4  times  as  much  work  to  maintain  orthogonality. 

input  matrix  is  sparse,  but  has  over  3  million  nonzero  elements,  which  is  an  order  of  magnitude 
larger  than  the  number  of  dimensions  of  the  matrix.  Therefore,  matrix-vector  products  dominate  in 
the  single-vector  Lanczos  method.  Interestingly,  the  FLOP  counts  are  roughly  equivalent  for  small 
dimensions,  but  the  non-minimal  Krylov  subspace  becomes  more  expensive  for  larger  dimensions. 
This  is  due  to  the  costs  of  maintaining  orthogonality;  since  the  cost  of  maintaining  orthogonality  is 
0{f{x)^),  where  fix)  is  some  function  of  the  dimension  of  the  Krylov  subspace.  As  fix)  is  assumed 
to  be  radical,  then  the  cost  of  maintaining  orthogonality  is  quadratic  in  the  number  of  dimensions. 
As  inner  power  iteration  reduces  the  number  of  iterations  necessary,  then  less  work  will  be  required 
to  exclude  already-converged  Ritz  vectors  from  the  rest  of  the  Krylov  subspace.  The  cost  tradeoff 
is  in  addition  to  the  storage  space  tradeoff;  storing  a  basis  for  JffQiiAP ,xq)  requires  roughly  19.5 
megabytes  of  storage  double  precision  numbers,  but  Jt256(A,xo)  requires  nearly  78  megabytes. 

7.5  Conclusion 

In  some  cases,  minimal  Krylov  subspace  projections  may  be  used  to  approximate  a  truncated  sin¬ 
gular  vectorspace  or  eigenspace.  For  many  PCA-like  applications,  the  leading  eigenpairs  needed  for 
the  truncated  projection  converge  fairly  quickly;  however,  their  convergence  may  be  unsatisfactory 
due  to  the  minimal  length  of  the  Krylov  subspace.  Non-minimal  Krylov  subspaces  may  achieve  far 
better  eigenvalue  approximations,  but  the  extra  storage  requirements  and  iteration  costs  negate  the 
compute  and  space  advantages  of  minimal  Krylov  subspaces. 
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As  local  eigenvalue  gaps  drive  convergence  in  Krylov  subspaces,  one  may  add  inner  power  itera¬ 
tion  to  the  classic  Lanczos  algorithm  to  change  the  Krylov  subspace  from  Jti(A,xo)  to  ,xq). 

Thus,  one  may  generate  minimal  Krylov  subspace  J14(A^,xo)  <=  J^ipiA,xo)  for  the  non-minimal 
Krylov  subspace  J^ip{A,XQ).  The  effects  of  this  change  may  be  understood  as  expansion  of  local 
gaps  at  the  leading  end  of  the  spectrum,  thereby  resulting  in  faster  convergence  of  leading  eigen¬ 
values  than  in  ^i{A,xo).  When  the  input  matrix  is  sparse  in  proportion  to  its  dimension,  notable 
compute  and  space  advantages  of  minimal  Krylov  subspaces  are  maintained. 

We  have  only  considered  simple  inner  power  iteration  generating  polynomials  of  the  form  q(x)  = 
x^.  Application  of  more  complicated  polynomials  of  the  form  q{x)  =  apX^  +  ap-ix^~^  + . . .  -i-  ao  where 
not  all  Uj  are  zero  may  lead  to  even  better  results  and  allow  more  targeting  of  the  spectrum  of  the 
input  matrix  A.  The  theoretical  arguments  that  show  how  inner  power  iteration  accelerates  the 
convergence  of  leading  eigenvalues  and  decelerates  the  convergence  of  trailing  ones  also  implies  that 
use  of  a  polynomial  q{x)  =  x^^^  will  accelerate  convergence  of  trailing  eigenvalues  at  the  expense 
of  leading  ones.  An  inner  Newton  iteration  could  approximate  the  product  A^^^x  for  problems  that 
require  the  smallest  eigenvalues  of  A. 
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Shift  and  invert  preconditioning 


Some  low-rank  approximation  problems  have  spectra  that  cause  difficulty  for  Krylov  subspaces; 
tightly  clustered  eigenvalues  converge  slowly.  Inner  power  iterations  can  improve  the  quality  of 
minimal  Krylov  subspace  projections  for  truncated  SVD  or  spectral  decomposition  approximation; 
this  is  due  to  acceleration  of  leading  eigenpairs.  Inner  power  iteration  is  inexpensive  and  appropri¬ 
ate  for  problems  that  require  approximation  of  the  leading  part  of  the  spectrum  of  the  input  matrix. 
Many  problems  in  low-rank  approximation  require  the  leading  eigenpairs  or  singular  pairs;  however, 
some  important  problems  require  the  eigenpairs  or  singular  triplets  with  smallest  magnitude  rather 
than  the  largest  magnitude.  Inner  power  iteration  is  not  appropriate  for  these  cases,  as  it  will  de¬ 
grade  the  quality  of  the  Krylov  subspace  by  suppressing  the  very  eigenpairs  or  singular  triplets  the 
problem  requires.  One  may  apply  a  fractional  matrix  power  Up  to  accelerate  convergence  of  small 
eigenpairs,  but  this  would  require  Newton  iteration  to  approximate  the  product  several  times 
over  for  each  iteration.  Moreover,  the  trailing  eigenpairs  are  often  so  tightly  clustered  that  inner 
power  iteration  will  not  accelerate  the  convergence  of  trailing  eigenpairs  sufficiently. 

Shift-and-invert  preconditioning  [24,51,72]  is  used  in  Krylov  subspace  methods  to  accelerate  con¬ 
vergence  of  eigenpairs  that  converge  slowly  due  to  tight  clustering.  Shift-and-invert  preconditioning 
leverages  the  fact  that  matrix  polynomials  in  A  may  be  applied  to  the  eigenvalues  of  A,  so  that  for 
an  arbitrary  Hermitian  matrix  A  with  spectral  decomposition  A  =  UAU^  and  a  polynomial  q{x) 

(8.1)  q{A)  =  q{UAU^)  =  UpiA)U'^. 
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This  is  the  same  property  leveraged  by  inner  power  iteration,  but  instead  of  using  the  polynomial 
q{x)  =  for  some  p  e  N,  shift-and-invert  preconditioning  uses 

(8.2)  q{x)  =  {x-s)~^ 

for  some  fixed  shift  s.  By  comparison  with  inner  iteration  which  only  requires  extra  sparse  matrix- 
vector  products,  shift-and-invert  preconditioning  requires  an  expensive  matrix  inversion,  as  the 
name  suggests.  When  the  input  matrix  A  is  large  and  sparse,  direct  matrix  inversion  may  be  ac¬ 
complished  with  a  sparse  LU  factorization;  this  may  be  expensive  in  general,  but  is  efficient  in  some 
cases.  For  example,  the  sparse  factorization  is  inexpensive  for  matrices  arising  from  circuit  simu¬ 
lation  and  for  matrices  of  some  classes  of  graphs.  When  the  sparse  factorization  is  not  expensive, 
then  shift-and-invert  preconditioning  may  be  applied  to  drastically  improve  the  convergence  of  trail¬ 
ing  eigenvalues  of  the  input  matrix  A.  Choosing  shifts  close  to  eigenvalues  of  interest  accelerates 
their  convergence.  Unfortunately,  such  a  shift  is  difficult  to  choose  beforehand  in  many  low-rank 
approximation  problems.  Choosing  shifts  adaptively  can  accelerate  convergence;  this  is  the  basis 
for  algorithms  such  as  Rayleigh  quotient  iteration  and  Jacobi-Davidson  [4].  Neither  algorithm  is 
suited  to  low-rank  approximation  without  alteration.  Rayleigh  quotient  iteration  may  exhibit  er¬ 
ratic  convergence,  and  cannot  be  guaranteed  to  converge  to  eigenvalues  close  to  the  initial  shift  s. 
Jacobi-Davidson  converges  more  regularly,  but  we  have  observed  that  the  subspaces  it  produces  may 
be  inferior  for  low-rank  approximation  to  those  produced  by  Krylov  subspace  iteration  on  the  shifted 
and  inverted  matrix  for  trailing  eigenvalue  problems.  In  order  to  produce  adaptive  shifts  for  Krylov 
subspaces,  we  propose  a  new  algorithm  that  first  uses  the  random  projection  method  in  Algorithm  3 
to  generate  a  set  S  of  trailing  eigenvalue  approximations  for  A,  and  then  forms  a  matrix  polynomial 
of  the  form  ns£s(-^  “  sl)~^  to  simultaneously  perform  all  shifts. 

The  role  of  the  shift  parameter  s  in  (8.2)  is  significant,  and  allows  control  over  the  parts  of  the 
spectrum  of  A  that  are  accelerated.  Convergence  of  eigenvalues  in  Krylov  subspaces  is  governed  by 
local  gaps  Aj  -  Aj+i;  appl3dng  a  polynomial  filter  merely  changes  the  gaps  to  glA;)-  g(Ai+i).  There¬ 
fore,  eigenvalues  Aj  close  to  s  will  have  A;  —  s  <  e  for  some  small  e  and  will  have  correspondingly 
large  g(Aj).  Moreover,  as  the  derivative  of  q{x)  is  large  for  x  close  to  s,  even  those  Aj  that  are  tightly 
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clustered  will  become  well  separated  with  application  of  the  polynomial  filter.  Choice  of  a  good  shift 
parameter  will  result  in  much-improved  convergence  relative  to  a  mediocre  or  poor  choice  of  s.  For 
positive  semi-definite  problems  requiring  trailing  eigenvalues,  it  is  sufficient  to  simply  set  s  =  0, 
but  without  a  priori  information  about  the  spectrum  of  the  input  matrix,  a  better  choice  is  elusive. 
Multiple  shifts  may  be  used  to  accelerate  convergence  to  the  desired  part  of  the  spectrum,  and  the 
subspaces  combined  or  recycled  to  avoid  loss  of  any  useful  information  already  computed.  The  ratio¬ 
nal  Krylov  method  [65]  uses  multiple  shifts  to  target  multiple  parts  of  the  spectrum;  however,  it  is 
intended  for  non-Hermitian  problems  and  when  shifts  are  known  beforehand.  For  Hermitian  prob¬ 
lems,  simplifications  are  possible  that  allow  for  accelerated  convergence  and  reduced  computational 
costs.  Additionally,  these  simplifications  appear  to  allow  for  reduced  sensitivity  to  the  initial  choice 
of  initial  shift  parameter  s,  resulting  in  more  robust  performance  in  general. 


8.1  Background 


Krylov  subspace  projections  are  successful  at  quickly  calculating  eigenvalue  approximations  for  lead¬ 
ing  and  trailing  eigenvalues  that  are  moderately-separated  from  their  neighbors.  When  eigenvalues 
are  tightly  clustered,  then  convergence  requires  a  great  many  iterations;  this  slow  convergence  can 
be  critical  due  to  practical  constraints  on  the  size  of  the  Krylov  subspace  basis  that  can  be  stored  in 
memory.  Even  when  restarts  are  used  to  extend  the  number  of  iterations,  slow  convergence  of  eigen¬ 
values  between  restarts  may  be  sufficiently  poor  to  cause  stagnation.  For  example,  implicitly  Lanczos 
iteration  [11,46]  using  a  Krylov  subspace  of  dimension  100  failed  to  bring  the  trailing  eigenvalue  of 
the  Colorado  graph  network  Laplacian  to  convergence,  even  after  24  hours  of  computation. 

Recalling  the  asymptotic  bounds  governing  convergence  of  eigenvalues  in  Krylov  subspaces,  we 
have  that  eigenvalue  bounds  have  error  bounded  by 


(8.3) 


0< 

^  j 


tan^0 


where 

(8.4) 


r  = 


Xj-Xr^l 

Ar+1  ~  Ajnf 
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For  the  Colorado  road  network  Laplacian,  the  trailing  eigenvalues  have  small  values  for  j  in  (8.3). 
The  resulting  values  for  +  j)l{l  —  j))  are  small  even  for  large  i.  For  example,  we  have  An  = 

0.0489,  An-i  =  0.132  and  Ai  =  321,018.5.  The  resulting  value  of  y  =  2.59  x  10“^  gives  TioolCl  +  t)/(1- 
y))  =  1.0052.  Even  expanding  the  block  size  causes  minimal  improvement  in  eigenvalue  convergence, 
for  a  block  size  of  50,  the  value  of  TioolCl  +  y)/(l  -  y))  =  1.297. 

To  allow  Krylov  subspaces  to  calculate  eigenvalues  that  are  tightly  clustered,  one  may  apply 
transformations  to  the  spectrum  that  leave  the  eigenvectors  unchanged.  For  example,  appl3dng  a 
diagonal  shift 

(8.5)  A'  =  A-sI 

to  generate  a  new  matrix  A'  leaves  eigenvectors  unchanged.  Likewise  matrix  inversions  share  the 
same  eigenspace,  since  A~^  =  UA~^U^  when  A  =  U AU^ .  In  general,  any  matrix  polynomial  in  A 
does  not  alter  the  eigenspace  of  A,  so  to  calculate  eigenvalues  of  A  that  are  prone  to  slow  convergence 
one  may  apply  a  shift  and  an  inversion.  Appl3dng  shifts  is  not  expensive,  but  matrix  inversion  may 
be  costly  when  the  input  matrix  is  large  and  sparse.  When  a  sparse  factorization  is  tractable,  simply 
inverting  the  matrix  will  result  in  satisfactory  convergence  of  eigenvalues.  Following  the  Colorado 
road  network  example  with  A’  =  A~^,  the  reciprocated  eigenvalues  are  =  A~^  =  20.454,  Ag  =  A~^\^  = 
7.576,  and  A^  =  A^^  =  3.115  x  10“^,  which  gives  y  =  0.63  and  Tiolll  +  y)/(l  -  y))  =  1.22  x  10®.  Though 
this  convergence  is  strong,  selection  of  a  shift  close  to  A'^  will  lead  to  even  faster  convergence.  With 
s  =  0.04,  A\  =  {An  -  0.04)-^  =  112.51,  A^  =  (A„-i  -  0.04)-^  =  10.87,  and  A'^  =  (Ai  -  s)“^  =  3.115  x  10“®, 
which  gives  y  =  0.9  and  Tiolll  +  ylAl-y))  =  4.48  x  10^®.  We  note  that  though  the  resulting  eigenvalue 
approximations  are  for  (A  —  s/)“^,  eigenvalue  approximations  for  A  may  be  recovered  by  application 
of  the  function 

(8.6)  Ax)  =  (x“^  +  s)“\ 

In  some  important  cases,  sparse  LU  or  Cholesky  factorization  is  not  expensive.  For  example, 
sparse  factorizations  are  tractable  for  inversion  of  non-Hermitian  matrices  arising  from  circuit  sim¬ 
ulation;  these  inversions  are  used  to  approximate  transfer  functions  defined  in  terms  of  a  large, 
sparse  matrix  pencil.  Some  classes  of  graphs  share  the  same  sparsity  and  planarity  properties  of 
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circuit  graphs  that  allow  for  efficient  sparse  factorizations;  road  network  graphs  are  one  such  graph. 

The  mention  of  circuit  simulation  graphs  is  not  simply  to  demonstrate  a  class  of  graphs  for  which 
shift  and  invert  methods  are  feasible  due  to  relatively  inexpensive  sparse  factorizations,  but  also  to 
motivate  methods  to  generate  improved  convergence  due  to  good  selection  of  shifts.  This  notion  is 
not  limited  to  the  non-Hermitian  eigenproblem;  numerous  existing  methods  use  adaptive  shifts  and 
inversion  to  accelerate  convergence  towards  chosen  eigenvalues,  such  as  Rayleigh  quotient  iteration 
or  the  Jacobi-Davidson  method  [4].  In  the  circuit  simulation  domain,  shifts  are  easy  to  know  a  priori 
as  they  are  given  by  outside  constraints  on  the  problem;  the  shifts  are  complex  frequencies  around 
which  the  circuit  simulation  must  be  accurate.  For  a  low-rank  approximation  problem  in  which  low- 
order  eigenpairs  are  needed,  it  is  only  known  that  the  eigenvalues  are  close  to  zero.  For  these  cases, 
judicious  shifts  are  not  easy  to  know  beforehand. 

Without  good  beforehand  knowledge  of  the  neighborhoods  containing  the  desired  eigenpairs  of 
A,  good  shifts  for  the  simple  shift  and  invert  method  are  elusive.  One  may  adapt  the  shift  parameter 
such  that  it  exploits  available  information  to  choose  shifts  close  to  where  eigenvalues  are  expected  to 
be.  A  rational  polynomial  of  the  form 


(8.7) 


q{x)=  n 

Si^S 


1 

(x-Sj) 


for  a  set  of  shifts  S  can  produce  faster  convergence  within  a  neighborhood  containing  the  shifts  S. 
Likewise,  an  adaptive  algorithm  could  use  the  already-computed  eigenvalue  approximations  of  A  to 
choose  shifts  that  are  close  to  eigenvalues  that  have  not  converged  satisfactorily. 

Adaptive  or  multiple  shifts  are  not  without  their  drawbacks.  Adaptive  shifts  require  explicit  or- 
thgonalization  to  generate  an  orthonormal  basis  for  the  resulting  Krylov  subspace  union  Usi£S  ^j((A- 
SiI)~^,xo)  for  a  set  of  shifts  S.  This  increases  the  computational  complexity  of  each  Lanczos  iteration 
from  linear  0(nk)  to  quadratic  in  the  number  of  dimensions  Oink^).  A  further  problem  is  that  the 
restriction  A*^^  of  A  to  Usi£S  -^i((A  -  s  xq)  is  not  explicitly  computed.  Conversely,  use  of  rational 
polynomials  other  than  the  simple  shift  q{x)  =  (x  — s)“^  will  render  any  low-rank  approximation  of 
the  input  matrix  A  restricted  to  JCi(g(A),xo)  to  be  useless  for  approximating  eigenvalues  of  A.  This 
would  require  a  re-projection  of  A  into  Jifi{q{A),xo),  at  a  cost  of  0{nk^)  FLOPS. 
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Adaptive  shifts  are  also  not  novel.  For  example,  Rayleigh  quotient  iteration  [4]  may  be  viewed  as 
inverse  power  iteration  with  an  adaptive  shift  at  each  step.  The  resulting  convergence  is  improved 
from  linear  convergence  to  the  smallest  eigenpair  to  cubic.  The  Jacobi-Davidson  [4, 77]  method  also 
uses  adaptive  shifts  at  each  step  to  likewise  accelerate  convergence  and  cope  with  inexact  matrix 
inversion.  Polynomial  filtering  methods  are  also  not  novel  in  general,  having  been  used  for  implicitly 
approximating  the  truncated  SVD  [42]. 

8.2  Adaptive  shift  methods 

Adaptive  shift  methods  for  eigenvalue  problems  may  readily  be  adapted  to  low-rank  approximation 
problems  as  an  accelerated  alternative  to  a  Krylov  subspace  from  a  shifted  and  inverted  matrix.  The 
simplest  adaptive  shift  method  is  Rayleigh  quotient  iteration,  presented  in  numerous  texts,  we  refer 
to  the  description  in  [4].  Rayleigh  quotient  iteration  is  similar  to  inverse  iteration  —  power  iteration 
on  (A — s/)“^  —  but  uses  an  adaptively-determined  shift  s  rather  than  a  static  one.  It  derives  its  name 
from  the  method  used  to  choose  the  shift  at  each  iteration;  it  uses  the  Rayleigh  quotient  v'^Av/v'^v 
to  determine  the  next  shift. 

Rayleigh  quotient  iteration  implicitly  defines  a  subspace  in  terms  of  its  vectors  g;;  is  sufficient 
to  simply  apply  an  orthonormalization  process  such  as  Gram-Schmidt  to  produce  an  orthogonal  set 
of  vectors.  Neglecting  the  cost  of  the  matrix  inversion,  the  overriding  complexity  is  from  the  or¬ 
thonormalization  process,  and  is  0{nk^)  for  A  e  The  resulting  algorithm  is  in  Algorithm  12. 

A  drawback  of  Rayleigh  quotient  iteration  is  that  it  has  unpredictable  behaviors  [4];  it  may  not  con¬ 
verge  to  the  eigenvalue  closest  to  the  initial  shift  s,  and  the  inital  start  vector  go  is  also  influential 
in  convergence,  and  should  be  close  to  Ui,  where  (A;,  Uj)  is  the  eigenpair  towards  which  convergence 
is  desired.  For  example,  simply  selecting  s  =  0  for  a  positive  definite  input  matrix  A  may  not  produce 
the  smallest  eigenvalue  Ainf.  This  unpredictable  behavior  may  be  corrected  by  use  of  the  subspace 
Q  in  Algorithm  12  to  generate  a  subspace  approximation  of  A  in  span{Q}.  The  shift  s  may  then 

be  chosen  to  be  AinffA*-^^).  This  essentially  results  in  a  variant  of  the  Jacobi-Davidson  Algorithm  [4] 
with  an  exact  solution.  However,  the  method  omits  restarts,  as  it  is  intended  to  produce  a  minimal 
subspace  in  the  same  spirit  as  minimal  Krylov  subspaces. 
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Algorithm  12  Rayleigh  quotient  iteration  and  orthonormal  subspace  generation 
Require;  a  priori  chosen  start  vector  qi  and  initial  shift  s. 

1:  Q^qilWqiW 

2:  for  j  =  1,2,...  do 

3:  y  ^{A-siy'^qi 

4:  if  (A  —  si)  is  singular  then 

5:  return  qi+i,6 

6:  end  if 

V:  0-|lyll2 


9:  qi+i^y/o 

10:  Q  [Q  Q'j  +  1  ~~QQ'^ Qi  +  l/Wqi  +  l  ~QQ^Q'j  +  ll|] 

11:  end  for 


These  methods  exhibit  fast  convergence,  but  for  generating  a  subspace  they  may  not  be  ideal 
choices  due  to  the  need  for  a  matrix  inversion  at  each  iteration.  Though  we  have  assumed  that  the 
sparse  factorization  of  A  -  s/  is  tractable,  it  may  still  be  a  non-negligible  expense.  It  is  notable  the 
Jacobi-Davidson  may  tolerate  an  inexact  solution  of  (A  -s/)x  =  qi,  but  this  still  requires  an  iterative 
solver,  which  may  be  more  expensive  than  the  sparse  factorization  of  A  —  s/  in  the  cases  under 
consideration  here.  Computational  savings  may  be  realized  by  reducing  the  number  of  factorizations 
needed  per  subspace  dimension  generated.  This  motivates  consideration  of  Krylov  subspaces  using 
multiple  shifts  when  sparse  matrix  factorizations  are  not  expensive. 

8.3  Polynomial  filter  methods 

Eigenvalue  convergence  may  be  accelerated  by  using  multiple  shifts  in  ways  alternate  to  the  multiple 
shifts  in  Rayleigh  quotient  iteration  to  produce  better  subspaces  at  reduced  cost.  One  such  method  is 
to  use  polynomials  other  than  the  simple  shift  and  invert  polynomial  (8.2).  Slightly  more  complicated 
polynomials  may  not  require  much  more  computation,  but  could  lead  to  faster  convergence  of  the 
desired  eigenvalues  than  a  simple  fixed  shift,  and  have  less  sensitivity  to  the  choice  of  shift  s  as 
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methods  using  (8.2). 

Shift-and-invert  methods  attain  better  convergence  through  expansion  of  local  gaps  between 
eigenvalues.  The  role  of  local  gaps  is  expressed  in  the  denominator  of  8.3.  The  effectiveness  of  a 
preconditioner  is  influenced  by  how  much  it  is  able  to  expand  gaps  between  desired  A,  and  Aj+i 
when  they  are  tightly  clustered,  and  how  small  it  renders  the  inflmum  of  eigenvalues.  For  a  simple 
shift  we  present  the  following  proposition. 

Proposition  8.  Let  Ai  >  A2  >  ...  >  Ajnf  be  real  eigenvalues,  let  s  be  an  arbitrary  real  number,  let 
g  =  Xi-  Aj+i  be  the  eigenvalue  gap  between  A;  and  Aj+i  for  some  natural  i.  Let  the  distance  between 
the  shift  s  and  A;  be  written  as  5  =  Aj  -s.  Let  gi(x)  =  (x-s)“^  be  a  spectral  transform  polynomial. 
Then  the  gap  for  the  transformed  eigenvalues  is  given  by 


(8.8) 


gi(Aj)-gi(Ai+i)=  - - - - 


1  1  _  1  1 
Aj-s  Xi+g-s  5  5  +  g' 


We  have  the  limit  of  (8.8)  goes  to  00  as  5  —  0,  which  implies  that  gaps  are  expanded  proportionally 
to  the  closeness  of  the  shift  to  Aj .  This  means  that  good  choice  of  s  will  be  important  to  getting  good 
results  out  of  a  shift-and-invert  method. 

To  alleviate  the  importance  of  the  shift  s  on  convergence  of  eigenvalues,  we  may  use  multiple 
shifts,  and  form  a  matrix  polynomial  of  the  form  q{x)  =  nsi£s(^~s)“^  for  a  set  of  shifts  S.  This 
approach  allows  multiple  guesses  for  the  neighborhoods  containing  desired  eigenvalues,  and  the 
multiple  shifts  reinforce  each  other.  For  example,  consider  the  case  |S|  =  2  and  the  following  proposi¬ 
tion. 


Proposition  9.  Let  Ai  >  A2  >  . . .  >  Ainf  be  real  eigenvalues,  let  si  and  S2  be  arbitrary  real  numbers,  let 
g  =  Aj-Aj+i  be  the  eigenvalue  gap  between  Aj  and  Aj+i  for  some  natural  i.  Let  the  distance  between  the 
shift  s  and  Aj  be  written  as  5  =  Xi-s.  Let  q2{x)  =  ((x-S2)(x-S2))~^  be  a  spectral  transform  polynomial. 
Assume  without  loss  of  generality  that  |Aj  -  sil  <  |Aj  -  S2I.  Let  d  =  Aj  -  si.  Let  S2  =  si  -1-  c.  Then  the 
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transformed  gap  is  given  by 


g2(Ai)-g2(Aj+i)  = 


(Aj  -  siKAj  -  S2)  (Aj+i  -  si)(Aj+i  -  S2) 


(8.9) 


(Aj  -siKA;  -Si  -c)  (A;  +  ^-si)(Aj  +  ^-si-c) 


5‘^  +  5c  {5  +  g){5  +  g-c) 


This  equation  also  has  a  limit  of  ±00  as  5  0,  provided  that  g  fO.  Moreover,  the  gap  expansion 

is  larger  when  the  proportion  of  the  gap  g  to  the  distance  of  the  shifts  is  small,  as  the  denominator 
in  the  leading  term  of  term  of  (8.9)  is  instead  of  5.  For  example  consider,  if  Aj+i  =  5.868  x  10“^  and 
A;  =  1.621  X  10“®,  then  g  =  1.034  x  10“®.  If  s  =  si  =  1  x  10“®  and  S2  =  1  x  10“^,  then  from  (8.8),  qi(x)  = 
1.112  X  10®  but  from  (8.9),  q2{x)  =  -5.196  x  10^^.  These  are  the  actual  values  of  eigenvalues  from  the 
normalized  Colorado  road  network  Laplacian  graph,  presented  in  the  following  results  section.  By 
having  more  shifts,  the  chance  of  one  being  a  good  guess  is  improved.  We  note  that  a  polynomial  of 
the  form  in  (8.7)  may  transform  a  positive  semi-definite  input  problem  into  an  indefinite  problem. 
This  presents  a  difficulty,  as  the  width  of  the  interval  containing  all  eigenvalues  A  e  [a,  6]  is  increased. 
This  implies  that  the  value  of  gamma  from  (8.4)  may  be  increased  due  to  the  larger  value  of  A;  —  inf. 
Replacement  of  q{x)  =  nsi£s(^  ~  s;)”^  with  g(x)^  maintains  positive  semidefiniteness  of  the  input 
problem  and  avoids  the  growth  of  Aj  -  Ainf  relative  to  Aj . 

Naturally,  blind  guesses  at  locations  of  eigenvalues  may  lead  to  choices  that  are  rather  poor. 
Even  when  the  input  matrix  is  positive  semi-definite  and  small  eigenvalues  are  known  to  be  greater 
than  0,  locating  the  location  of  those  that  are  responsible  for  most  of  the  norm  of  A"*"  is  still  difficult. 
Multiple  shifts  do  alleviate  the  need  to  precisely  locate  the  neighborhood  of  the  desired  small  eigen¬ 
values,  but  do  not  alleviate  the  need  so  much  that  beforehand  guesses  may  lead  to  convergence  better 
than  simply  using  q(x)  =  x~^.  Rather  than  simple  blind  guessing,  one  may  use  a  preliminary  set  of 
iterations  to  gather  information  regarding  the  neighborhoods  of  desired  eigenvalues  of  A.  With  infor¬ 
mation  regarding  the  neighborhoods  of  eigenvalues  of  A,  one  may  either  perform  a  thick  restart  [84] 
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Figure  8.1:  Leading  10  eigenvalues  and  eigenvalue  approximations  using  JFio(A^,xo)  and  random  projections. 
Random  projection  approximations  have  more  uniform  tightness  compared  to  single-vector  Krylov  subspace 
approximations. 

and  save  whatever  useful  spectral  information  may  be  gleaned  from  the  preliminary  subspace,  or 
simply  discard  it  and  perform  new  iterations.  With  better  information  regarding  the  neighborhoods 
containing  the  desired  eigenvalues  of  A,  multiple  shifts  can  outperform  a  single  shift  more  decisively. 

We  recall  that  the  random  projection  method  in  Algorithm  3  and  investigated  in  previous  chap¬ 
ters  produces  good  eigenvalue  approximations.  One  characteristic  we  have  observed  is  that  the 
random  projection  method  using  a  block  size  of  k  is  that  it  produces  eigenvalue  approximations  that 
have  roughly  uniform  tightness  for  all  k  leading  eigenvalues.  Figure  8.1  illustrates  this  difference 
in  bounds.  Note  that  the  random  projection  method  produces  better  approximations  of  eigenval¬ 
ues  8  through  10.  This  behavior  suggests  that  random  projections  or  block  Krylov  subspaces  with 
large  blocks  be  used  to  generate  the  preliminary  subspace  for  determining  the  shifts  to  use.  With 
this  information,  we  present  the  algorithm  for  multiple  shift  polynomial  preconditioning  for  Krylov 
subspaces. 

A  drawback  of  using  polynomials  of  the  form  (8.7)  is  that  it  is  no  longer  possible  to  recover  the 
original  eigenvalues  of  the  input  matrix  A  from  those  of  the  transformed  matrix  q(A);  this  necessi¬ 
tates  the  reprojection  of  A  into  the  solution  subspace  on  line  3.  A  function  of  the  form  q{x)  =  (x-s)“^ 
is  bijective,  so  there  is  always  a  unique  solution  y  for  q{y)  =  x  for  any  x.  Multiple  shifts  result  in 
polynomials  that  are  not  necessarily  bijective,  so  there  may  be  multiple  solutions  of  q{y)  for  a  given 
X.  Therefore  the  shift  must  be  used  to  generate  a  projection  basis  Q  with  span{Q}  =  Jtk(q(A),xo),  and 
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Algorithm  13  Preconditioned  Krylov  subspaces  with  multiple  shift  polynomials 
Require;  a  priori  chosen  start  block  Xq  with  m  columns  for  using  m  shifts. 

1:  compute  17, 'P  using  random  projections  (Algorithm  3)  on  A~^  with  Xq. 

2:  compute  Q,T  using  Lanczos  iteration  (Algorithm  1)  on  with  T-Jh  Vi/k. 

3:  A^Q'^AQ. 

4:  spectral  decompose  V,&  =  A 
5:  return  QV,Q 

that  basis  then  used  to  project  the  original  input  matrix  A  to  Q^AQ  for  eigenvalue  approximation. 
This  requires  a  dense  matrix-matrix  product  with  complexity  0(nk^).  We  note  that  this  is  the  same 
complexity  as  the  complete  Gram-Schmidt  process  necessary  for  generation  of  an  orthonormal  basis 
in  Rayleigh  quotient  iteration  or  the  Jacobi-Davidson  method. 

8.4  Numerical  results 

To  demonstrate  the  improvement  realizable  through  use  of  multiple  shift  polynomials  over  single 
shift  polynomials,  we  perform  experiments  with  two  large  graph  Laplacian  matrices  for  which  the 
sparse  factorization  is  reasonably  inexpensive.  We  compare  the  rates  of  convergence  over  many  di¬ 
mensions.  Multiple  shift  polynomials  are  similar  to  inner  iteration  preconditioning  in  that  both  use 
polynomials  or  order  2  or  greater;  we  also  compare  the  convergence  of  multiple  shift  polynomials 
against  inner  power  iteration  with  a  single  shift  to  exclude  effects  of  simply  using  a  higher  order 
polynomial  from  the  effects  of  using  judiciously-chosen  multiple  shifts.  We  also  consider  the  conver¬ 
gence  of  eigenvalues  in  Jacobi-Davidson  spaces  for  a  further  point  of  reference. 

We  compared  the  convergence  of  eigenvalues  for  the  Colorado  road  network  Laplacian  matrix, 
both  for  the  normalized  and  combinatorial  Laplaicians.  The  leading  64  eigenvalues  of  both  matrices 
is  shown  in  Figure  8.2.  We  remark  that  both  matrices  exhibit  tight  clustering  of  the  trailing  eigenval¬ 
ues.  This,  combined  with  the  relatively  large  leading  eigenvalues,  leads  to  terribly  slow  convergence 
of  trailing  eigenvalues.  The  values  of  7  from  (2.3)  are  7  =  5.17  x  10“^  for  the  normalized  Laplacian 
and  7  =  2.59  x  10“^  for  the  combinatorial  Laplacian.  Even  for  large  subspaces,  the  convergence  of 
these  trailing  eigenvalues  is  limited;  the  denominator  term  in  (2.3)  for  the  normalized  Laplacian  is 
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Figure  8.2:  Trailing  64  eigenvalues  and  leading  eigenvalue  (circled)  for  normalized  Graph  Laplacian  (left)  and 
combinatorial  graph  Laplacian  matrices  (right)  for  the  Colorado  road  network  graph. 

less  than  5  for  a  subspace  dimension  of  1,000.  Therefore  preconditioning  is  needed. 

These  matrices  are  always  positive  semi-definite  with  exactly  one  null  eigenvector,  and  their 
structure  is  planar  and  sparse.  We  have  observed  that  the  sparse  LU-factorization  may  be  computed 
in  the  order  of  seconds  using  the  Super-LU  library  [47].  These  matrices  make  good  candidates  for 
application  of  multiple  polynomial  shifts.  For  comparison  of  the  methods  discussed  in  the  preced¬ 
ing  section,  we  first  performed  single-vector  Lanczos  iteration  to  produce  a  low-rank  approximation 
of  L  restricted  to  JF'i(L''',xo);  this  corresponds  to  simple  shift  preconditioning  for  s  =  0.  We  also 
performed  baseline  runs  with  Jacobi-Davidson  using  a  deflation  tolerance  of  t  =  1  x  10“^^.  For  eval¬ 
uation  of  multiple-shift  polynomials,  we  ran  Algorithm  13  with  a  multi-shift  parameter  m  =  4.  We 
generated  subspaces  of  dimension  up  to  64,  and  compared  the  nuclear  norms  of  the  approximation 
to  the  approximation  generated  by  the  truncated  spectral  decomposition,  which  is  coincides  with 
the  truncated  SVD  as  the  input  matrix  is  positive  semi-definite.  Figure  8.3  shows  the  low-rank  ap¬ 
proximation  norms  for  the  normalized  Laplacian,  and  Figure  8.4  shows  the  low-rank  approximation 
norms  for  the  combinatorial  Laplacian.  The  multi-shift  Krylov  subspace  projections  produce  bet¬ 
ter  low-rank  approximations  than  either  Jacobi-Davidson  or  single-shift  Krylov  subspaces  for  s  =  0 
or  s  =  L^_j(?i(A^*^)/^  with  e  spanlAXo).  It  is  important  that  the  multi-shift  Krylov  subspaces 
require  multiple  solutions  of  a  sparse  factored  LU  matrix  per  dimension,  rather  than  one  solution 
for  the  single  shift  subspaces.  This  presents  an  additional  computational  expense;  in  this  way  the 
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subspace  dimension 


Figure  8.3:  Errors  of  low-rank  approximations  of  normalized  Laplacian  for  Colorado  road  network  graph: 
Jacobi-Davidson  v.  multi-shift  Krylov  (upper-left),  single  shift  with  shift  from  preliminary  random  projections 
approximation  v.  multi-shift  (upper-right),  and  unshifted  pseudoinverse  v.  multi-shift  (bottom). 
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Figure  8.4:  Errors  of  low-rank  approximations  of  combinatorial  Laplacian  for  Colorado  road  network  graph: 
Jacobi-Davidson  v.  multi-shift  Krylov  (upper-left),  single  shift  with  shift  from  preliminary  random  projections 
approximation  v.  multi-shift  (upper-right),  and  unshifted  pseudoinverse  v.  multi-shift  (bottom). 
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subspace  dimension  subspace  dimension 

Figure  8.5:  Comparison  of  low-rank  approximation  norms  between  inner-iteration  preconditioning  on  A~^ 
with  4  inner  iterations  and  multi-shift  preconditioning  with  a  preconditioning  factor  of  m  =  4.  The  results 
for  the  combinatorial  Laplacian  of  the  Colorado  road  network  are  on  the  left  and  those  for  the  normalized 
Laplacian  are  on  the  left. 

method  of  using  multiple  shifts  may  be  compared  to  the  inner  power  iteration  method.  Multiple 
shifts  will  derive  some  benefit  from  the  multiple  shifts,  but  will  also  obtain  some  benefit  from  the 
inner-iteration  like  use  of  a  higher-order  matrix  polynomial  to  better  drive  convergence  of  leading 
eigenvalues.  For  example,  one  could  use  the  sparse  LU  factorization  combined  with  inner  iterations 
to  form  A~"^  for  instead  of  using  Algorithm  13  and  m  shifts.  One  may  also  apply  inner  iteration 
preconditioning  to  the  singly-shifted  Krylov  subspace  ^i({A  -^/)“^,xo)  for  p  =  'E."t^9i{A^^^)/k  for  an 
initial  random  projection  eigenvalue  approximation.  The  difference  between  the  two  methods  de¬ 
pends  on  how  well  the  multiple  shifts  accelerate  convergence  compared  to  simply  using  a  higher 
order  matrix  polynomial.  Compared  to  inner  iteration  preconditioning,  the  multiple  shift  method 
does  not  have  an  advantage  as  distinct  as  its  advantage  over  the  single-shift  and  Jacobi-Davidson 
methods;  it  does  produce  only  slightly  better  low-rank  approximations,  especially  for  intermediate 
dimensions.  Figure  8.5  shows  the  comparison  between  inner  power  iteration  preconditioning  and 
multi-shift  iteration.  These  results  suggest  that  some  non-negligible  degree  of  the  good  approxima¬ 
tion  generated  by  the  multi-shift  methods  is  due  to  the  inner-iteration  effects  expanding  local  gaps  in 
the  leading  spectrum  of  A“^.  However,  for  intermediate  dimensions,  multiple-shift  preconditioning 
still  produces  slightly  better  approximations,  which  may  be  significant  when  memory  resources  are 
limited. 
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8.5  Conclusion 

Interesting  classes  of  problems  require  low-rank  approximation  of  a  matrix  using  its  eigenpairs  with 
smallest  rather  than  largest  dimension.  For  these  matrices,  Krylov  subspaces  or  random  projections 
fail,  as  the  latter  is  intended  only  to  approximate  leading  eigenvalues,  and  the  former  encounters 
difficulty  when  the  input  matrix  has  tightly-clustered  eigenvalues.  In  these  instances,  Krylov  sub¬ 
spaces  can  still  give  good  convergence  if  the  input  problem  can  be  preconditioned  with  a  shift  and 
matrix  inversion.  Simple  matrix  inversions  yield  good  results  for  positive  definite  problems,  but  the 
convergence  can  be  accelerated  by  using  multiple  shifts  that  are  close  to  eigenvalues  of  interest. 

In  comparison  with  the  combination  of  inner  iteration  preconditioning  with  shift-and-invert  pre¬ 
conditioning  with  one  shift,  multiple  shifts  have  some,  albeit  limited,  advantages.  When  shift-and- 
invert  with  a  single  shift  produces  adequate  approximations,  then  addition  of  inner  iteration  will  im¬ 
prove  these  to  good  results,  and  multiple  shifts  will  lead  to  a  marginal  improvement.  Multiple  shifts 
will  be  most  appropriate  when  each  solution  of  the  factored  LU  system  is  of  moderate  or  greater 
cost.  This  would  impose  a  limit  on  the  number  of  inner  iterations  due  to  cost,  and  the  advantages  of 
multiple  shifts  over  inner  iteration  will  be  most  pronounced. 

One  unexpected  result  we  witness  in  these  experiments  is  the  comparatively  poor  approxima¬ 
tion  properties  of  the  Jacobi-Davidson  method  for  generating  low-rank  approximations  in  minimal 
iterations.  In  every  case,  a  shift-and-invert  transformation  applied  to  the  Lanczos  process  produced 
a  better  low-rank  approximation  error  than  the  Jacobi-Davidson  method.  This  suggests  that  shift- 
and-invert  Lanczos  methods  are  preferable  to  Jacobi-Davidson  for  minimal  low-rank  approximation 
when  matrix  inversions  are  not  expensive,  and  future  efforts  could  generalize  these  results  to  inexact 
Krylov  subspaces. 
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Low-rank  approximation  using  Krylov 
subspaces  for  fast  convolution 


Low-rank  matrix  approximation  provides  opportunities  for  dramatic  improvements  in  compute  time 
in  some  instances.  One  such  instance  is  convolution,  which  may  be  used  for  signal  filtering  or  model 
reduction.  For  example,  simply  filtering  n  signals  requires  n  convolutions,  or  evaluation  of  k  frequen¬ 
cies  of  the  pseudovelocity  shock  response  spectrum  [28]  on  n  input  signals  requires  nk  convolutions. 
Low-rank  approximation  allows  for  a  reduction  in  work  corresponding  to  the  reduction  in  rank  of  the 
input  data.  The  discrete  convolution  of  two  functions  f  and  g 

OO 

(9.1)  {f*g)[n\=  Y.  f\m\g[n-m\ 

m=—oo 

may  be  obtained  more  simply  in  the  frequency  domain  as 

(9.2)  (f  *  g){n\  =  ■  ^{g}}{n\. 

As  the  Fourier  transform  is  itself  an  orthonormal  projection,  we  may  apply  basis  expansion  S'ig}  = 
UiCi  for  r  basis  vectors  Ui  in  the  frequency  domain  and  reconstruct  the  convolution  {f  *  g)  using 
the  convolved  basis  vectors  {f  *  This  may  drastically  reduce  the  number  of  convolutions 

necessary  when  presented  with  a  large  number  of  input  vectors  that  are  low  rank.  The  sole  require¬ 
ment  on  the  basis  vectors  Ui  is  that  they  be  linearly  independent;  one  may  use  singular  vectors  from 
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a  proper  orthogonal  decomposition  [13,  76],  an  orthogonal  basis  set  from  random  projections  or  a 
basis  set  for  a  Krylov  subspace.  Therefore,  this  problem  type  allows  evaluation  of  the  random  pro¬ 
jection  method  against  the  hybrid  block  Krylov  subspace  method,  GrABL  and  inner  power  iteration 
for  low-rank  approximation. 

We  consider  an  application  of  low-rank  approximation  to  a  set  of  1-dimensional  input  acceler¬ 
ation  time  histories.  These  time  histories  were  obtained  from  the  response  of  two  1-DOF  linear 
spring-mass-damper  systems  to  random  impulsive  loading.  These  systems  were  meant  to  represent 
simple  physical  structures  and  model  their  response  to  shock  loading.  As  only  two  separate  systems 
were  used  to  generate  the  data,  the  rank  of  the  output  was  much  lower  than  the  actual  number  of 
signals  contained.  We  consider  the  error  in  low-rank  approximation  of  the  various  methods  devel¬ 
oped  previously  and  its  impact  on  theoretical  error  bounds  on  the  convolution  output.  The  resulting 
data  matrix  has  131,072  columns  and  2,000,0000  rows,  with  497,197,284  nonzero  elements.  Solving 
the  eigenproblem  on  A^A  will  result  in  sparse  matrix-vector  products  that  are  far  more  expensive 
than  dense  linear  algebra  operations.  Therefore,  low-rank  approximation  methods  that  minimize 
sparse  matrix-vector  products  will  result  in  the  most  meaningful  compute  cost  savings  when  only 
the  low-rank  approximation  A^^^  to  A  need  be  computed.  When  a  orthonormal  projection  matrix  Q 
with  Q^A  =  is  required,  dense  matrix  operations  are  a  significant  cost,  and  the  reduction  in 
complexity  due  to  block  size  shrinkage  may  be  expected  to  result  in  compute  savings  for  the  shrink- 
and-iterate,  hybrid  random  projections-block  Krylov  and  GrABL  methods. 

The  methods  described  here  bear  similarity  to  Krylov  methods  for  model  reduction  and  solution 
of  ODEs,  though  there  are  distinctions  between  them.  For  ODE  solution,  one  may  solve 

(9.3)  y(t)  =  Ay{t) 

using  the  matrix  exponential  with  initial  condition  y(0)  =  yo-  Krylov  subspace  projections  may  be 
used  to  approximate  the  matrix  exponential  operator  with  a  lower-rank  restricted  to  a  Krylov 
subspace  [38,  68].  Model  reduction  applications  also  use  subspace  projections  for  approximating 


159 


9.  Low-rank  approximation  using  Krylov  subspaces  for  fast  convolution 

solutions  to  differential  equations  of  the  form 

(9.4)  Ex{t)  =  Ax  +  Bu{t) 

(9.5)  y{t)  =  C'^x{t). 

These  may  be  solved  with  convolution  of  the  system’s  transfer  function  with  the  input  vector  as 
y{t)  =  {h*  u){t),  which  is  accomplished  in  the  frequency  domain  as  Y(s)  =  U{s)-H{s)  with 

(9.6)  His)  =  D  +  C'^{A-Esr^B. 

Krylov  subspace  approximations  to  the  matrix  in  (9.6)  representing  the  transfer  function  may  be  used 
to  approximate  the  system  response  at  a  single  frequency  s  or  a  set  of  complex  frequencies  [27,34,35]. 
Other  model  reduction  methods  for  nonlinear  systems 

(9.7)  x{t)  =  f(x{t),u{t)) 

are  based  on  proper  orthogonal  decomposition  of  the  inputs  u{t)  and  outputs  y{t)  empirically  ob¬ 
tained  from  model  runs  [2]  that  may  be  used  to  project  the  high-dimensional  state  x{t)  into  a  low¬ 
dimensional  subspace  and  produce  a  low-rank  approximation  ^{t).  In  all  these  cases,  the  model  — 
be  it  H{s)  or  x{t)  —  is  approximated  in  a  low-dimensional  subspace.  Thus,  the  complexity  of  the 
model  is  reduced.  In  this  application  of  low-rank  approximation,  we  are  reducing  the  dimension  of 
the  inputs  rather  than  the  model,  which  is  allowed  to  remain  high-dimensional.  It  would  correspond 
to  approximating  yo  or  U{s)  for  the  ODE  or  model  reduction  problems  rather  than  the  operators  that 
describe  the  state  of  the  system.  The  reduction  in  compute  cost  comes  not  from  deriving  a  model 
approximation  that  is  easier  to  evaluate,  but  finding  a  low-rank  space  that  characterizes  all  of  the 
possible  inputs.  The  basis  vectors  of  that  low-rank  input  space  then  may  be  run  through  the  orig¬ 
inal,  high-dimensional  model  and  then  can  reconstruct  the  outputs  y{t).  This  enables  reduction  of 
compute  costs  for  cases  in  which  evaluation  of  the  transfer  function  H{s)  may  not  be  expensive,  but 
must  be  evaluated  a  great  many  times. 
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9.1  Low-rank  approximation  error  and  convolution 

If  one  approximates  an  input  signal  f  with  a  low-rank  approximation  f  such  that  f  =  f  -\-e  for  some 
error  e,  then  the  low-rank  approximation  error  will  influence  the  error  of  operations  on  f.  In  the 
case  of  convolution  (f  *  g),  the  error  IK/  *  g)-(f  *  g)||  depends  on  the  approximation  error  e  of  f  and 
g.  The  formal  results  are  summarized  in  the  following  theorem. 

Theorem  12.  Let  f  and  g  be  two  arbitrary  vectors  in  Let  f  be  a  low-rank  approximation  of  f 
with  f  +  e  =  f  such  that  f  Le.  Let  e=  \\e\\^.  Then  the  error  in  the  discrete  convolution  \\{f  *  g)- if  *  g)\\'^ 
is  bounded  as 

(9.8)  \\{f  ^g) -if  *g)f  <01,^6 

where  Gmax  =  inaxi{|.^{g}j|}  is  the  maximum  element  of  the  frequency  domain  transform  of  g. 

Proof  Since  the  frequency  domain  transform  is  unitary,  then  ||x||  =  ||.^{x}||  for  any  x.  Let  F  =  ^{f}, 
F  =  ^{f},  E  =  J^{e}  and  G  =  t^{g}.  Note  that  F  =  F  -\-  E,  as  a  frequency  transform  may  also  be 
expressed  as  a  change  of  basis. 

(9.9)  \\{f*g)-{f^g)f  =  \\{E-G)-{F-G)f  =  Y^{EiGi-FiGif. 

i=l 

Since  E  =  F  —F,  then  any  individual  entry  Ei  of  E  has 

(9.10)  E\  =  F'l+F'l-2EiFi. 


Then 

(9.11) 


Y^iEiGi -FiGif  =  Y  GM  +  G^Ff  -  2GiFiFi  =  Y  G^Ef 

i=l  i=l  i=l 


If  Gmax  =  maxj{|.^{g}i|},  then 


(9.12) 


n  n 

•^2  772  _ /-^2  272  _ /-^2 


X  G^sf  <  gLx  =  f^Lxii^r  =  gLx^ 

i=l  i=l 


as  the  frequency  transform  is  unitary  and  ||e||^  =  \\E\\‘^  =  e. 


□ 
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One  may  note  that  the  Frobenius  error  of  the  low-rank  approximation  of  A  with  A^^^  is  the 
aggregate  ||e||^  error  over  all  columns  of  A.  Thus  ||A— A^^^|||,  does  bound  any  ||e||^  from  above; 
when  an  orthogonal  projection  is  used  to  produce  the  errors  are  orthogonal  to  the  low-rank 
approximations,  and  colums  Uj  of  a  have  approximation  error 

Thus  we  may  simply  measure  the  norms  of  the  original  data  ||aj||  and  their  corresponding  low- 
rank  approximations  to  easily  calculate  ||eil|.  Thus,  it  is  an  easy  matter  to  apply  the  bounds  from 
Theorem  12  a  posteriori  to  determine  the  maximum  possible  error. 

9.2  Numerical  results 

To  generate  numerical  results,  we  generated  data  using  two  1  degree-of-freedom  linear  spring-mass- 
damper  systems  representing  stiff  mechanical  structures.  The  systems  were  subjected  to  randomly- 
determined  impulsive  loading  for  an  initial  period  of  10  milliseconds,  and  then  subjected  to  a  second 
50  millisecond  period  of  impulsive  loading  at  1  second.  The  systems  were  run  for  2  seconds,  at  which 
point  they  had  reached  equilibrium.  The  sampling  rate  was  2  MHz.  2^^  runs  were  performed;  2^® 
for  each  system.  Sparse  random  noise  was  added  to  the  outputs.  The  resulting  data  matrix  had 
2  X  10®  columns  and  2^^  rows;  the  matrix  had  497,197,284  nonzero  elements.  We  did  not  center  the 
columns  of  the  matrix.  The  leading  64  elements  of  the  spectrum  of  A^A  is  shown  in  Figure  9.1.  We 
compare  the  results  of  the  proper  orthogonal  decomposition  against  low-rank  approximation  using 
random  projections,  GrABL,  the  hybrid  block  Krylov-random  projections  method  with  50%  block  size, 
single-vector  Lanczos  with  inner  power  iteration  with  p  =  2  and  GrABL  with  p  =  0.001,  a  deflation 
window  of  3  and  p  =  2.  These  parameters  resulted  in  a  initial  block  size  of  4.  All  methods  were  used  to 
generate  low-rank  approximations  of  ^  =  2*  dimensions  for  i  =  1, 2, . . . ,  6.  The  low-rank  approximation 
error  indicates  the  maximum  convolution  error,  and  we  present  a  simple  example  using  a  low-pass 
Butterworth  filter.  We  also  consider  the  compute  costs  of  random  projections  and  the  Krylov  subspace 
methods  developed  previously  herein. 
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Figure  9.1:  Leading  64  eigenvalues  of  A^A. 


We  first  consider  the  squared  Frobenius  norm  of  the  approximation  error;  this  is  indicative  of  the 
aggregate  squared  Euclidean  error  over  all  columns  of  the  input  matrix  A.  The  squared  Frobenius 
norm  of  the  approximation  error  ||  A— A^^^  |||,  for  A^^^  as  computed  by  PCA  (equivalent  to  POD),  GrABL 
with  p  =  2,  inner  iterations  with  p  =  2,  random  projections  and  the  hybrid  random  projections-block 
Krylov  method  operating  in  Jt^lAjA^Ao)  are  shown  in  Figure  9.2.  Even  though  there  are  over 
100,000  signals,  it  is  possible  to  reconstruct  all  of  them  in  a  64-dimensional  subspace  with  Frobenius 
norm  error  of  less  than  1%  for  all  methods.  The  SVD  produces  the  smallest  Frobenius  norm  error 
by  definition,  and  the  hybrid  method  is  the  relaxed  method  with  best  error  at  all  dimensions.  As 
mentioned  previously,  Frobenius  norm  error  is  combined  error  over  all  columns  of  A  —A^^\  Since 
columns  of  A  are  individual  signals,  the  approximation  error  of  any  one  individual  signal  ||aj  — dj  ||^  = 
lie;  11^  is  bounded  by  ||A-A*^^|||,,  but  this  is  likely  a  pessimistic  bound.  Equation  (9.13)  gives  a  method 
to  easily  compute  the  individual  error  a  posteriori.  Figure  9.3  shows  the  maximum  individual  error 
magnitude  (max;  He;  ll^l/lla;  ||^;  this  is  the  percent  error  for  the  largest  individual  approximation  error. 

Theorem  12  shows  the  relationship  between  approximation  error  ||ai  —  djU^  and  the  convolved 
approximation  error  IKa;  *  g)-{di  *g)||^  for  some  appropriate  g.  We  consider  two  arbitrarily-chosen 
signals,  one  from  each  of  the  models  run  to  produce  columns  of  A.  We  compare  the  approximation 
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Figure  9.2:  Squared  Frobenius  norm  of  approximation  error  of  A  with  for  PCA,  GrABL,  random  projec¬ 
tions,  inner  power  iterations  with  p  -2,  and  the  hybrid  random  projections-block  Lanczos  method. 


Figure  9.3:  Maximum  column  approximation  error  as  a  percent  for  PCA,  GrABL  and  random  projections  (left), 
and  PCA,  inner  power  iterations  with  p  -2,  and  the  hybrid  random  projections-block  Lanczos  method  (right). 
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Figure  9.4:  Original  and  approximated  example  signals  from  the  stiffer  of  the  two  models. 


Figure  9.5:  Original  and  approximated  example  signals  from  the  softer  of  the  two  models. 


norm  \\ai-di\\  for  each  signal,  as  well  as  the  convolved  approximation  error  ||(aj  *g)-(dj  *g)||^ 
for  a  second-order  low-pass  Butterworth  filter  with  a  cutoff  frequency  of  1  kHz.  Figure  9.4  shows 
the  original  signals  from  the  first  model,  and  Figure  9.5  shows  the  original  signals  from  the  second 
model.  Figures  9.6  and  9.7  show  the  filtered  signals  from  the  first  and  second  model,  respectively. 
Only  the  first  4000  samples  are  shown.  A  key  characteristic  of  Butterworth  filters  is  that  they 
are  flat  in  the  passband;  the  filter  used  for  these  examples  has  a  gain  of  no  more  than  1  for  any 
frequency.  Therefore,  Theorem  12  predicts  that  all  filtered  approximation  errors  will  be  no  greater 
than  unfiltered  approximation  errors.  For  the  shown  figures,  this  indeed  holds;  the  approximation 
and  filtered  approximation  errors  are  shown  in  Table  9.1. 

Approximation  accuracy  is  not  the  only  consideration  for  this  example;  the  resulting  matrix  has 
a  large  number  of  nonzero  elements,  especially  in  comparison  to  the  dimension  of  the  Gram  matrix 


Figure  9.6:  Original  and  approximated  example  signals  from  the  stiffer  model 
pass  filter. 


filtered  with  Butterworth  low- 
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Figure  9.7:  Original  and  approximated  example  signals  from  the  softer  model  filtered  with  Butterworth  low- 
pass  filter. 


SVD 

random 

projections 

GrABL 

inner  iterations 

hybrid  method 

model  1  error 

1.08  X  10“'^ 

1.91  X  10“^ 

9.32  X  10“^ 

1.67x10“^ 

1.16 X 10“^ 

model  1  filtered  error 

1.33  X  10“^ 

5.12  X  10“^ 

1.01  X  10“^ 

4.1 X 10“^ 

2.24  X  10“^ 

model  2  error 

3.58  X  10“'^ 

1.34  X  10“^ 

8.16  X  10“^ 

7.38x10“^ 

5.23  X  10“^ 

model  2  filtered  error 

1.85  X  10“^ 

6.07  X  10“^ 

7.11  X  lO""^ 

1.89x10“^ 

7.45  X  10“^ 

Table  9.1:  Convolved  and  unconvolved  errors  for  the  two  example  signals. 


A^A  used  as  input  to  the  various  algorithms  here.  The  matrix  A  has  nearly  a  half  billion  nonzero 
elements,  but  A^A  has  3  orders  of  magnitude  fewer  dimensions.  Compute  costs  for  developing 
the  low-rank  approximations  here  are  non-trivial,  and  due  to  the  structure  of  the  problem,  sparse 
matrix-vector  products  may  be  expected  to  be  the  predominate  driver  of  compute  costs.  Figure  9.8 
shows  the  theoretical  FLOP  counts  for  the  various  algorithms;  Figure  9.9  shows  the  ratio  of  FLOPS 
normalized  to  FLOPS  required  by  the  random  projection  algorithm.  The  random  projection  method 
is  not  the  most  expensive  for  small  dimensions;  this  is  due  to  the  larger  number  of  matrix-vector 
products  in  GrABL  during  its  initial  refinement  phase.  However,  for  larger  dimensions,  GrABL 
obtains  an  advantage.  Inner  iterations  and  the  hybrid  method  require  fewer  dense  floating-point 
operations  than  the  random  projection  method,  but  require  the  same  number  of  sparse  operations; 
for  this  problem,  their  compute  performance  should  be  expected  to  be  roughly  equivalent  to  the 
random  projection  method.  Only  GrABL  may  be  expected  to  be  faster. 

9.3  Discussion  and  Conclusion 

Overall  the  trends  developed  elsewhere  in  previous  chapters  are  also  apparent  in  these  observa¬ 
tions.  The  leading  singular  values  of  A  (and  eigenvalues  of  A^A)  are  well-separated,  and  become 
more  clustered  to  the  interior.  These  are  the  conditions  that  predict  the  hybrid  method  generating 
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subspace  dimension  subspace  dimension 


Figure  9.8:  Sparse  (left)  and  dense  (right)  FLOP  counts  of  random  projections,  GrABL,  inner  iterations  and 
the  hybrid  method. 


Figure  9.9:  Sparse  (left)  and  dense  (right)  ratios  of  GrABL,  inner  iterations  and  the  hybrid  method  to  random 
projections. 
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approximations  in  will  produce  smaller  approximation  errors  than  random  projections. 

The  random  projection  method  itself  may  produce  smaller  approximation  errors  than  a  single-vector 
Krylov  subspace  method,  but  at  the  cost  of  extra  sparse  matrix-vector  products.  When  the  hybrid 
method  produces  smaller  errors  than  random  projections,  then  it  is  implied  that  it  produces  smaller 
errors  than  a  minimal  single-vector  Krylov  subspace  as  well. 

The  rapidly-encountered  singular  value  clustering  also  implies  that  GrABL  will  use  a  small  ini¬ 
tial  block,  and  will  obtain  compute-cost  advantages  over  random  projections.  Due  to  the  large  num¬ 
ber  of  nonzero  elements  in  the  matrix,  the  matrix  vector  products  Ax  are  expensive,  and  dom¬ 
inate  the  compute  costs.  The  advantages  from  block  size  shrinkage  for  the  shrink-and-iterate,  the 
hybrid  random  projections-block  Krylov  method  and  GrABL  are  irrelevant.  Instead,  minimization 
of  matrix-vector  multiplications  is  most  important.  In  such  cases;  the  convergence  advantages  of 
methods  incorporating  some  stationary  iteration  may  be  erased  by  simply  generating  a  non-minimal 
Krylov  subspace.  However,  it  is  important  to  note  that  sparse  matrix-vector  products  are  rather 
amenable  to  parallelization;  whereas  some  of  the  dense  matrix  operations,  such  as  QR  factoriza¬ 
tions,  used  in  random  projections  and  block  Krylov  methods,  are  not  as  amenable  to  parallelization. 
Nevertheless,  in  instances  such  as  these,  GrABL  should  be  expected  to  present  principle  compute 
advantages  at  larger  dimensions.  Inner  power  iteration,  shrink-and-iterate  and  the  hybrid  method 
may  still  produce  smaller  errors,  but  compute  costs  should  be  expected  to  be  similar  to  random  pro¬ 
jections.  Finally,  the  compute  costs  considered  here  are  only  in  terms  of  finding  approximations  to 
singular  values  of  A.  If  Q^A^AQ  =  A**^  is  the  restriction  of  A^A  to  span{Q},  then  estimation  of  a 
basis  P  for  the  left  singular  vector  space  of  A  may  be  accomplished  with 

(9.14)  P  =  AQA^^^-^, 

but  this  operation  requires  k  additional  sparse  matrix-vector  products.  A  solution  would  be  to  use  the 
Golub-Kahan  on  A  rather  than  the  Lanczos  algorithm  on  A^A.  The  asymptotic  complexity  of  this 
method  would  be  at  least  as  bad  as  the  Lanczos  method  on  AA^,  and  the  compute  cost  advantage  of 
the  shrink-and-iterate,  hybrid  and  GrABL  methods  are  again  relevant. 

We  have  applied  random  projections,  GrABL,  single-vector  Lanczos  and  the  hybrid  random  pro- 
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jections/block  Krylov  subspace  methods  to  the  low-rank  approximation  problem  as  applied  to  effect 
fast  signal  convolution.  All  of  these  methods  are  intended  to  be  relaxed  alternatives  to  the  truncated 
singular  value  decomposition,  which  may  require  substantial  compute  time  to  solve.  Substitution  of 
a  relaxed  alternative  to  the  truncated  singular  value  decomposition  will  allow  for  possibly  significant 
compute  savings. 

There  are  various  methods  that  may  be  surrogates  for  a  truncated  singular  vector  space.  Single¬ 
vector  Krylov  subspaces  have  been  proposed  by  several  authors  [10, 14, 15,  75],  but  block  Krylov 
subspaces  have  not  been  considered.  Block  methods  offer  faster  convergence  of  Krylov  subspaces  to 
singular  vector  spaces,  and  can  better  exploit  local  singular  value  separation.  The  random  projection 
methods  proposed  in  [36]  suggest  consideration  of  block  Krylov  subspaces.  Random  projections  and 
single-vector  Krylov  subspaces  both  produce  compelling  results,  but  the  two  methods  provide  distinct 
advantages.  Random  projections  are  successful  at  excluding  singular  vectors  corresponding  to  small 
singular  values  from  its  subspace,  but  require  2k  matrix  multiplications  to  produce  a  ^-dimensional 
subspace.  Random  projections  also  require  0{nk^)  FLOPS  to  produce  a  ^-dimensional  subspace. 
Single-vector  Krylov  subspaces  only  require  k  matrix  multiplications  to  produce  a  ^-dimensional 
subspace,  but  may  include  singular  vectors  corresponding  to  small  singular  values  in  its  subspace; 
these  vectors  are  understood  as  noisy  elements  of  the  data.  Krylov  subspaces  also  are  plagued  with 
loss  of  orthogonality  with  continued  iteration,  and  require  reorthogonalizarion  measures  which  ran¬ 
dom  projections  do  not.  In  terms  of  cost,  single-vector  Lanczos  methods  produce  the  best  low-rank 
approximations,  but  random  projections  may  produce  the  best  in  terms  of  error.  Composite  methods 
combining  some  stationary  power  iteration  with  Krylov  subspace  iteration  may  produce  errors  that 
are  smaller  than  single-vector  Lanczos  and  possibly  better  than  random  projections,  compute  times 
that  are  no  worse  than  random  projections  and  still  require  less  stabilization  than  ordinary  Lanczos. 
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Conclusion 


This  dissertation  has  examined  acceleration  methods  for  minimal  and  near-minimal  Krylov  sub¬ 
spaces  for  use  in  generating  low-rank  matrix  approximations.  Krylov  subspace  projections  present 
a  relaxed  solution  from  the  Frobenius  norm-optimal  low-rank  approximation,  but  this  relaxation 
may  not  sacrifice  much  in  terms  of  approximation  error  and  still  yield  significant  savings  in  terms 
of  computatiomal  time  savings.  Existing  uses  of  Krylov  subspaces  for  low-rank  approximation  are 
intended  to  present  the  most  compute-time  savings  over  the  non-relaxed,  optimal  solutions.  Our 
methods  are  intended  to  produce  approximations  with  smaller  error  than  the  least  expensive  Krylov 
subspaces  examined  to  date,  but  are  still  faster  than  the  fully-converged,  norm-optimal  truncated 
SVD  or  spectral  decomposition  approximations.  Many  of  the  methods  examined  herein  are  based  on 
the  performance  of  random  projection  methods,  which  are  also  intended  for  use  as  truncated  SVD 
or  spectral  decomposition  alternatives.  Low-rank  approximations  arise  for  either  estimation  of  the 
eigenpairs  or  singular  triplets  of  a  matrix,  or  to  reduce  the  dimension  of  a  input  matrix  that  is  in¬ 
herently  low-rank,  but  is  embedded  in  a  high-dimensional  space.  Such  low-rank  problems  embedded 
in  high- dimensional  space  arise  frequently  in  engineering,  network  science,  machine  learning  and 
many  other  fields. 

Low-rank  approximation  may  be  optimally  solved  with  truncated  matrix  factorizations;  the  spec¬ 
tral  decomposition  is  truncated  when  the  input  matrix  is  square  and  positive  semi-definite,  and  the 
singular  value  decomposition  is  truncated  when  the  input  matrix  is  not  square  or  is  indefinite.  These 
factorizations  produce  the  best  rank-^  approximation  to  the  input  matrix  A  in  terms  of  the  spec- 
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tral  and  Forbenius  norm,  as  the  residual  is  given  by 
(10.1)  ||i<^^-A||>||i'*^||-||A|| 

with  equality  when  the  singular  vector  spaces  of  A  and  A^^^  coincide.  Thus,  the  minimum  value 
for  (10.1)  is  achieved  by  A^^^  =  Uk’^k^'^  or  A^*^  =  UkAkU'^  where  Uk’^k^k  truncated  SVD  or 

UkAkU'^  is  the  spectral  decomposition,  either  using  the  k  eigenpairs  or  singular  triplets  with  largest 
magnitude.  Regrettably,  computing  k  eigenpairs  or  singular  triplets  may  be  expensive,  even  when 
k  is  small  relative  to  the  dimension  of  A  and  the  computation  is  performed  iteratively.  Converged 
truncated  matrix  factorizations  may  be  slow  to  converge  for  a  number  of  reasons. 

Krylov  subspaces  are  an  alternative  to  the  use  of  a  eigenvector  or  singular  vector  space.  Krylov 
subspaces  are  defined  in  terms  of  the  span  of  a  start  vector  or  block  vector  and  a  matrix;  therefore, 
they  may  be  generated  using  only  matrix-vector  products  and  an  orthonormalziation  routine.  When 
the  input  matrix  is  Hermitian,  then  a  full  Gram-Schmidt  process  is  not  required  and  a  Krylov  sub¬ 
space  of  dimension  k  may  be  generated  for  a  n  x  n  input  matrix  with  asymptotic  cost  of  0{nk).  Even  a 
short  Krylov  subspace  will  produce  good  approximations  of  the  leading  eigenvalues  used  in  the  trun¬ 
cated  SVD  or  spectral  decomposition  when  those  eigenvalues  are  well-separated  from  the  remainder 
of  the  spectrum  as  is  typical  of  a  low-rank  problem  embedded  in  a  high-dimensional  space.  The  fast 
convergence  of  precisely  those  eigenvalues  or  singular  values  that  matter  most  combined  with  the 
good  computational  complexity  of  Krylov  subspace  generation  make  their  use  attractive. 

The  limited  existing  work  covering  Krylov  subspaces  for  low-rank  approximation  all  uses  mini¬ 
mal  Krylov  subspaces.  These  subspaces  are  the  least  expensive  to  generate,  as  they  only  require  k 
matrix  multiplications  to  generate  a  ^-dimensional  space.  This  alternative  may  be  much  less  expen¬ 
sive  than  a  converged  truncated  spectral  decomposition,  especially  when  k  is  large,  but  the  difference 
in  error  between  the  Frobenius  norm-optimal  truncated  SVD  or  spectral  decomposition  approxima¬ 
tion  and  the  Krylov  subspace  approximation  may  be  non-trivial.  One  is  presented  with  a  choice  of 
an  expensive  truncated  SVD  or  spectral  decomposition  with  optimal  error  or  an  inexpensive  Krylov 
subspace  with  non-trivial  error.  With  some  minor  increases  in  compute  time,  an  approximation  with 
smaller  error  may  be  possible. 
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Sources  of  error  in  Krylov  subspace  projections  are  due  to  error  in  eigenvalue  estimates,  and 
eigenvalues  that  are  converging  to  trailing  rather  than  leading  eigenvalues.  Use  of  block  Krylov 
subspaces  —  motivated  by  random  projection  methods  —  combined  with  Ritz  vector  truncation  of 
the  Krylov  subspace  both  decreases  the  error  of  leading  eigenvalue  approximations  and  excludes  ap¬ 
proximations  of  trailing  eigenvalues  form  the  Krylov  subspace.  Block  Krylov  subspaces  with  large 
start  blocks  and  short  Krylov  sequences  can  produce  low-rank  approximations  with  lower  error  than 
random  projections  at  equivalent  computational  complexity.  Short  Krylov  sequences  also  lead  to  de¬ 
creased  loss  of  orthogonality,  to  the  point  that  no  reorthgonalization  is  required.  Compute  costs  are 
increased  by  larger  block  sizes,  but  adaptive  deflation  can  reduce  these  to  limit  the  overall  computa¬ 
tional  complexity  of  subspace  generation  while  minimizing  the  convergence  sacrificed  by  deflation. 
Inner  power  iterations  —  similar  to  the  power  iteration-like  refinement  used  in  random  projections 
—  also  improves  convergence  in  minimal  Krylov  subspaces  by  accelerating  convergence  of  leading 
eigenvalues;  this  is  an  alternative  to  an  extended  Krylov  subspace  combined  with  Ritz  value  trun¬ 
cation.  Finally,  for  approximation  problems  that  require  trailing  eigenpairs,  shift- and-invert  precon¬ 
ditioning  may  benefit  from  multiple  shifts  over  inner  iteration  preconditioning  when  the  number  of 
inner  iterations  is  limited. 

We  have  presented  several  new  Krylov  subspace  algorithms  specialized  for  the  low-rank  approxi¬ 
mation  problem  in  which  some  computational  resource  is  constrained,  be  it  memory  or  compute  time. 
These  methods  are  intended  to  be  alternatives  to  either  minimal  single-vector  Krylov  subspaces  pro¬ 
posed  previously  or  the  random  projection  methods.  Short  block  Krylov  subspaces  with  large  blocks 
can  produce  subspaces  that  produce  low-rank  approximation  errors  smaller  than  the  random  pro¬ 
jection  method,  but  with  compute  times  that  are  at  worst  equivalent  and  could  be  up  to  25%  faster. 
We  have  developed  an  adaptive  block  Lanczos  method  that  applies  random  projections  to  inflate  and 
deflate  the  Krylov  subspace  to  best  exploit  local  eigenvalue  gaps.  We  have  developed  new  asymptotic 
bounds  that  characterize  the  effects  of  inner  power  iteration  in  a  single-vector  Krylov  subspace  for 
use  when  limited  storage  prevents  generation  of  a  large  Lanczos  basis.  We  have  investigated  the  use 
of  a  hybrid  random  projections-preconditioning  approach  to  select  better  shifts. 

Short  block  Krylov  subspaces  have  particular  advantages,  especially  when  sparse  matrix-vector 
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products  are  inexpensive  compared  to  the  dense  linear  algebra  operations  used  in  random  projection 
or  Lanczos  methods.  This  is  often  the  case  when  input  matrix  is  structured  and  fairly  sparse.  In  these 
cases,  short  block  Krylov  subspaces  hold  advantages  over  both  the  random  projection  method  and  sin¬ 
gle  vector  Krylov  subspaces.  The  hybrid  random  projection-block  Krylov  subspace  method  may  pro¬ 
duce  smaller  approximation  errors  than  random  projections,  but  have  equivalent  compute  time.  The 
hybrid  method  also  requires  no  reorthogonalization,  which  is  required  by  single-vector  Krylov  sub¬ 
spaces  methods.  Additionally,  for  large- dimensional  subspaces,  truncation  of  a  non-minimal  Krylov 
subspace  can  become  the  asymptotically  dominant  compute  cost. 

When  dense  matrix  operations  are  the  predominant  compute  costs,  reducing  block  sizes  in¬ 
evitably  leads  to  reduced  compute  costs.  GrABL  provides  a  greedy  method  to  reduce  the  block  size  to 
a  minimum,  subject  to  providing  best  convergence  of  leading  eigenvalues.  A  reduction  of  block  size 
also  allows  for  stationary  power  iteration  to  be  performed  only  with  the  initial  start  block,  thereby 
reducing  the  number  of  sparse  matrix-vector  products.  GrABL  presents  further  compute  saving  over 
the  shrink-and-iterate  and  hybrid  methods,  as  it  will  provide  compute  savings  over  random  projec¬ 
tions  in  all  cases.  We  have  observed  that  when  used  with  a  power  iteration  parameter  p  >  1,  GrABL 
produces  smaller  low-rank  approximation  norm  than  the  random  projection  method,  but  does  so  with 
fewer  sparse  matrix-vector  products  or  dense  linear  algebra  FLOPS  when  the  subspace  dimension  is 
sufficiently  large.  The  compute  time  advantages  will  be  expanded  further  when  sparse  matrix- vector 
products  are  expensive. 

Stationary  power  iteration  may  be  integrated  as  inner  iteration  in  single-vector  Lanczos  itera¬ 
tion.  This  accelerates  convergence  of  leading  eigenvalues  and  improved  the  quality  of  the  subspace 
for  low-rank  approximation  for  many  applications.  The  use  of  inner  power  iteration  is  useful  when 
memory  constraints  prevent  storage  of  non-minimal  Krylov  bases.  Nevertheless,  the  gap  between  a 
minimal  Krylov  subspace  using  p  inner  power  iterations  and  a  non-minimal  Krylov  subspace  that  is 
simply  p  times  longer  may  be  substantial.  Though  compute  costs  to  truncate  the  non-minimal  sub¬ 
space  may  be  non-trivial,  inner  iteration  should  be  limited  to  cases  for  which  sparse  matrix-vector 
products  are  cheap  and  memory  is  constrained. 

Judicious  choices  of  shifts  can  allow  for  faster  convergence  when  shift-and-invert  preconditioning 
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is  used  to  solve  a  trailing  eigenproblem,  as  is  required  for  spectral  graph  problems.  Though  faster 
convergence  can  lead  to  Krylov  subspaces  that  produce  lower-error  approximations,  using  shifts  from 
an  initial  random  projection  pass  does  not  appreciably  improve  approximation  over  simply  using  in¬ 
ner  power  iteration.  The  extra  compute  costs  for  using  multiple  shifts  must  be  considered.  Therefore, 
inner  power  iteration  is  a  better  choice  for  accelerating  convergence  of  trailing  eigenvalues  compared 
to  using  multiple  shifts. 

The  specialized  Krylov  subspace  methods  developed  here  will  present  more  choices  for  low-rank 
approximation  when  a  truncated  SVD  or  spectral  decomposition  is  optimal.  Rather  than  being  lim¬ 
ited  to  the  expensive  truncated  matrix  factorization  or  the  inexpensive  Krylov  subspace  or  random 
projections  approximation  that  still  has  some  error,  one  may  apply  an  acceleration  method  intro¬ 
duced  here  to  obtain  better  convergence  at  modest  compute  cost,  all  while  remaining  within  modest 
memory  footprints.  The  improved  stability  of  short  Krylov  sequences  may  allow  parallel  Krylov  sub¬ 
space  methods  to  become  simpler  and  eliminate  some  inter-node  communication.  Future  extensions 
of  this  effort  would  refine  and  simplify  the  bounds  on  the  errors  of  block  Krylov  subspaces  to  lend  fur¬ 
ther  insight  into  those  spectral  properties  that  cause  random  projections  to  outperform  block  Krylov 
subspaces.  Finally,  though  these  acceleration  methods  were  designed  specifically  for  minimal  Krylov 
subspaces,  integration  of  acceleration  methods  to  restarted  Krylov  subspaces  may  yield  iterative 
eigenvalue  computation  routines  that  converge  faster  for  certain  classes  of  Hermitian  matrices. 
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